Big data and predictive analytics: A systematic review of applications

Jamarani, Amirhossein; Haddadi, Saeid; Sarvizadeh, Raheleh; Haghi Kashani, Mostafa; Akbari, Mohammad; Moradi, Saeed

doi:10.1007/s10462-024-10811-5

Big data and predictive analytics: A systematic review of applications

Open access
Published: 17 June 2024

Volume 57, article number 176, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Big data and predictive analytics: A systematic review of applications

Download PDF

Amirhossein Jamarani¹,
Saeid Haddadi²,
Raheleh Sarvizadeh³,
Mostafa Haghi Kashani³,
Mohammad Akbari⁴ &
…
Saeed Moradi⁵

751 Accesses
Explore all metrics

Abstract

Big data involves processing vast amounts of data using advanced techniques. Its potential is harnessed for predictive analytics, a sophisticated branch that anticipates unknown future events by discerning patterns observed in historical data. Various techniques obtained from modeling, data mining, statistics, artificial intelligence, and machine learning are employed to analyze available history to extract discriminative patterns for predictors. This study aims to analyze the main research approaches on Big Data Predictive Analytics (BDPA) based on very up-to-date published articles from 2014 to 2023. In this article, we fully concentrate on predictive analytics using big data mining techniques, where we perform a Systematic Literature Review (SLR) by reviewing 109 articles. Based on the application and content of current studies, we introduce taxonomy including seven major categories of industrial, e-commerce, smart healthcare, smart agriculture, smart city, Information and Communications Technologies (ICT), and weather. The benefits and weaknesses of each approach, potentially important changes, and open issues, in addition to future paths, are discussed. The compiled SLR not only extends on BDPA’s strengths, open issues, and future works but also detects the need for optimizing the insufficient metrics in big data applications, such as timeliness, accuracy, and scalability, which would enable organizations to apply big data to shift from retrospective analytics to prospective predictive if fulfilled.

Artificial intelligence-based solutions for climate change: a review

Article Open access 13 June 2023

Big data in healthcare: management, analysis and future prospects

Article Open access 19 June 2019

Trends and Future Perspective Challenges in Big Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Big data analytics refers to various techniques to analyze data, extract information, and gain insights from large-scale datasets with complex patterns because conventional data-processing views cannot be easily dealt with. Data having a huge number of samples have higher statistical power, as over-complicated data with great dimensional feature space might lead to a greater rate of false discovery (Breur 2016). Data capture, data storage, search, visualization, transfer, sharing, query, update, data analysis, information privacy, and data source are the most important challenges in big data (Anagnostopoulos et al. 2016). The recent application of the term big data seems to refer to using user behavior analytics, predictive analytics, or different advanced data analytic methods that obtain value from data, and rarely to a data set of special size. In essence, available data quantities now are huge; however, it is not the most related feature of this data ecosystem (Rodríguez-Mazahua et al. 2016). Similarly, medical experts, researchers, business people, advertising, and governments usually have challenges dealing with large data sets in urban informatics, business informatics (Bhuimali et al. 2018), fintech (Thakuriah et al. 2017), and web surfing. Scientists face constraints in e-science activities, including meteorology (Fathi et al. 2021), biology (Gharajeh 2018), genomics (Wong 2016), complicated physics simulations, and environmental studies. Naturally, the big data ecosystem is explained by: Value, Veracity, Velocity, Variety, and Volume.

Value is the worth of the data being extracted. Data has no use or importance in itself, but it requires to be changed into a valuable to extract information. In addition, veracity defines data quality and value. It majorly affects the quality of captured data and the exact analysis. Velocity specifies the pace at which the data is generated and processed to satisfy the needs and challenges of the growth and development path. Variety elucidates the type and nature of the data. It assists those in analyzing it to use the resulting insight influentially. Lastly, Volume outlines the amount of generated and stored data. The data size defines the value and potential insight and whether or not it can be regarded as big data (Fathi et al. 2021).

With the emergence of systems of big data, predictive analytics has gained prominence. Enterprises have bigger and greater data pools in big data platforms, which has led to an increase in data mining opportunities to obtain predictive insights (Mohamed et al. 2020). The commercialization of machine learning tools has also expedited this trend, which results in emerging demands for predictive analytic services (Casado and Younas 2015). A large volume of techniques is used by predictive analytics to help organizations forecast outcomes, techniques that continue to develop with the expanding adoption of big data analytics. Big Data Predictive Analytics (BDPA) defines frameworks and systems that gather, analyze, and give an interpretation of great variety, volume, velocity, veracity, and value data to show patterns, trends, and relationships within data to find challenges and opportunities, to foresee future happenings, and direct decision making in contexts of applications.

Predictive analytics turns to statistical methods from forecasting modeling, machine learning, and data mining, which examine recent historical facts to foresee events or future that are unrevealed (Nyce 2007). In the business domain, predictive models apply the patterns that are found in transactional and historical data to find chances and dangers. Models find relations among lots of factors to permit assessing the risks or potentials inherent in a special set of conditions to direct decision-making for candidate transactions (Coker and Pulse 2014). In this functional effect, related to technical methods, an estimated score demonstrating a probability is provided by predictive analytics for each person, a customer, an employee, a healthcare patient, a vehicle to inform, determine, or influence organizational processes that are related to a wide range of people in healthcare, manufacturing, fraud detection, marketing, and the like. Predictive analytics is used in healthcare (Etemadi, et al. 2023), smart cities (Karimi et al. 2021), marketing (Miles et al. 2017), retail (Huang et al. 2019), actuarial science (Homer et al. 2017), social networking (Failed 2017a), financial services (Ouahilal et al. 2016), insurance (Longhi and Nanni 2019), telecommunication (Failed 2018a), mobility (Moreira-Matias et al. 2016), travel (Amirian et al. 2016), child protection (Russell 2015), pharmaceuticals (Sohrabi 2019), capacity planning (Delfmann et al. 2019), and other fields. This Systematic Literature Review (SLR) is arranged with the central aim of recognition, taxonomic classification, comparison of the big data analytic approaches, and systematic contrast of the current articles that concentrate on planning, executing, and authentication of big data analytics. In accordance with the former purposes, we have made a determined effort to answer the following study questions: What are the fields of prediction analysis applications in big data? What are the evaluation metrics of predictive analytics using big data? What evaluation methods are used in BDPA? What are the tools and environments in BDPA? And, what are the challenges and future issues of BDPA?

We accompanied the recommendations in Kitchenham (2004); Brereton et al. 2007). Our main goal is to possess a systematic identification and assort classification of the current authenticated achievements on BDPA. Up to the present date, according to our inspections and studies, only the minority of SLRs investigated BDPA thoroughly. Another factor to consider is that none of them presented a complete and precise review article in the field of BDPA. Furthermore, since BDPA is an ultra-critical and sensitive field, it is necessary to provide a comprehensive study. Due to this, we have studied 109 articles to provide an exhaustive systematic review of predictive analytic methods that exploit big data for better performance. Although there are some reviews on various big data approaches, the vast majority of reviews do not focus on predictive analytic challenges, open issues, benefits, and drawbacks. To overcome this shortcoming, a systematic review of the literature and overview is presented for predictive analytics by using big data. This article assists researchers in gaining an overall understanding of various approaches that utilize big data for predictive analytics. The key contributions of this study are summarized as follows:

Representing a systematic review of the predictive analytic approaches by utilizing big data
Providing a technical and comprehensive taxonomy that categorizes various applications in BDPA as depicted in Fig. 1.
Making a detailed comparison of the applied evaluation metrics and methods, tools, and the pros and cons of each study
Determining open challenges and future trends of predictive analytics by applying big data

The remaining parts of this article are structured as follows. Section 2 considers some relevant review works. Section 3 provides the research methodology, including research questions and the article selection process. Section 4 provides a classification for big data predictive analytics. Section 5 shows the comparisons and results of the articles reviewed. Section 4.6 refers to open issues and further studies. And, the final results are illustrated in Section 4.6.1.

2 Related work

Different kinds of review articles have been prepared in the field of BDPA. And, in this section, some of these articles are reviewed and compared with our work.

Philip Chen and Zhang (Philip Chen and Zhang 2014) surveyed big data applications, big data opportunities, and upcoming state-of-the-art methods and technologies that are being adopted to tackle the challenges of big data. To deal with problems, they provided techniques to tackle the pitfalls of big data, such as cloud computing, quantum computing, and biological computing. However, the article’s procedure for selected articles was not provided, and it was not systematic.

Kumaresan and Rajakumar (Kumaresan and Rajakumar 2015) provided details on predictive analytics. The area of predictive analytics was discussed. The authors introduced several tools and techniques in predictive analytics. The given list of reviews was examined and deliberated upon to ascertain the utilization of predictive analytics by researchers in industrial and medical contexts. Various techniques and approaches were referenced in the process. This research discussed various issues and challenges of predictive analytics, available tools, applications, and modeling techniques in big data. However, no guideline for future research was suggested; recent year's published articles were not considered, no taxonomy was prepared, the article selection process was not transparent, and it was not an SLR.

Banumathi and Aloysius (Banumathi and Aloysius 2017) provided a review of different predictive analytic applications and approaches. Analytic methods, with dissimilar perspectives based on applications and data variety, were considered. Some of the applications discussed are big data in health care, hotel governance, consumer orientations, higher education, and data e-governance. The authors presented predictive approaches adapted for different applications with challenges and suggestions. The article specifically identified the main applications that depend thoroughly on BDPA solutions and already adopted themselves as one of the big data entities. Nevertheless, this study did not deliver an SLR, and the article selection process was not clear.

Poornima and Pushpalatha (Poornima and Pushpalatha 2018) introduced the thought of using predictive analytics and data mining methods on various medical datasets to foresee different illnesses with all advantages, disadvantages, and accuracy levels included that are related to future approaches to big data. The review list was discussed and presented to see how different authors applied predictive analytics for medicine and business and how they were regarded. The algorithms and techniques were also referred to while being applied to big data. However, this research did not represent an SLR, and no taxonomy was prepared. In addition, the article selection process is not clear. In other words, possible future studies were not presented. Ghani, et al. (Ghani et al. 2019) survey looked at different angles in social media big data analytic topics. The authors arranged the survey based on different features. They provided a discussion on the applications of social media big data analytics by taking methods and quality tokens from various studies. Open research challenges and future works in big data analytics are introduced, but the authors did not consider noting the covered years of the articles reviewed, and it was not an SLR.

Kaffash, et al. (Kaffash et al. 2021) worked on a review taking into account the big data algorithms and applications on intelligent transportation. In this study, no taxonomy was organized. Mallika and Selvamuthukumaran (Mallika and Selvamuthukumaran 2022) provided a review of prospective precision medicine by utilizing big data. They illustrated the most frequently used tools and computational platforms on which precision medicine based its foundation on the role of big data; however, their review had no taxonomy nor any article collection process. Nobanee, et al. (Nobanee et al. 2022) only reviewed the applications of big data in the area of credit risk assessment. The authors described the notions of credit risk and types of credit risk and then connected the relation of big data with credit risk management. However, one of the weaknesses of their work was the limitation of their scope and the number of reviewed articles.

Ikegwu, et al. (Ikegwu et al. 2022) reviewed big data analytics in data-driven industries, covering tools, data sources, challenges, solutions, and research directions. The authors discussed different classification methods, data characteristics, and real-life applications across sectors. The study reviewed related studies to big data analytics, which were published only between 2013 and 2021. Also, the study is not a systematic literature. Himeur, et al. (Himeur, et al. 2023) provided an overview of the fall shorts of building automation and management systems (BAMSs) in terms of performance evaluation, energy consumption analysis, and security. The authors reviewed various AI-based tasks, presented existing frameworks, and discussed challenges of BAMSs performance in intelligent buildings.

In screening to manage diabetes long and short-term complications, predictive models were introduced by Cichosz, et al. (Cichosz et al. 2016). The authors also presented a systematic mapping study (SMS). These models have been created to manage diabetes and its related problems, and there has been a tremendous rise in the number of studies on these models recently. A linear regression or multiple logistics was applied to develop the prediction model, probably because of its clear functionality. Finally, in order to prove the usefulness of prediction models, they have to show their impact, or in other words, their application should yield more satisfactory outcomes in patients. Despite all efforts made to build these predictive models, a considerable scarcity in impact studies was observed. However, there was not a systematic review in this study, and the method of selecting articles was unclear. Newly published articles were also excluded.

To reach high-level comprehension in big data manufacturing, O’Donovan, et al. (O’Donovan et al. 2015) provided an SMS. Their contributions were some reports on the current state of work concerning big data approaches in assembling, such as methods of research taken into account, sectors in producing where big data exploration were concentrated, and results from big data research projects. The authors classified their study based on different research questions and answers. Nonetheless, their study did not provide any information regarding the covered years of the studied articles.

Different predictive models were classified by Muthukrishnan, et al. (Muthukrishnan et al. 2017), which were applied to monitor and improve the performances of students in educational settings similar to schools or universities. Within the educational data mining methodology, the whole areas were analyzed, two databases were selected, and systematic mapping research was conducted for this article. The main aim of the noted systematic mapping study was to examine the current predictive analytic models within the educational environment of schools and other educational institutions. Due to the need to understand the functional applications linked to the approaches in healthcare, Mehta, et al. (Mehta et al. 2019) provided an SMS by considering artificial intelligence with big data. To examine the improvements in this field, the authors employed bubble plots to map the arrangements of publications. They categorized their reviewed article into different sub-classes, which led to the creation of taxonomy; however, they did not review recently published articles.

Rahman and Reza (Rahman and Reza 2020) reviewed the non-functional requirements (NFRs) in big data. Afterward, they implemented a model to map the NFRs. The authors showed that some metrics, such as performance, scalability, and reliability are the most important factors in data-intensive systems. Biesialska, et al. (Biesialska et al. 2021) conducted an SMS to review agile software developments with the impact of big data. Their taken method of collecting articles was snowballing and manual search through the databases. In addition, articles were reviewed by the authors, in which their applications, company names, country, and per-industry usage were reviewed. Montero, et al. (Montero et al. 2021) took a systematic mapping approach to review big data quality models. The authors collected articles by providing an overview of their selection process, but they did not mention nor provide directions for future works on the quality models of big data.

The information system literature review was introduced by Ohiomah, et al. (Failed 2017b) on BDPA to find BDPA areas that were investigated before, but still needed greater focus. They suggested special research questions to be studied further and found out that big data arrival altered predictive analytic roles from such activities as generation and validation of theory to the more data-driven discovery of complicated patterns and relations among variables and evaluation of the probability of relationship occurrence in variables of a dataset. In this research, recently published articles were ignored. Mikalef, et al. (Mikalef et al. 2018) arranged an SLR on big data analytics to explain the system performance through which they should be leveraged to contribute to competitive productivity. They reviewed the research frameworks that were based on IT–business value, alongside the segments from strategic management. The authors focused on tools, technical methods, network analytics, and the infrastructure of big data analysis. Despite this, their review article did not pay attention to the recent date published articles.

Kolajo, et al. (Kolajo et al. 2019) tried to represent the flow of big data evaluation by providing a systematic review in order to recognize the tools and approaches. However, the authors did not provide a taxonomy for their study, and recently published articles were not included. Al-Sai, et al. (Al-Sai et al. 2020) provided an SLR to divide the schema and framework into five major groups of big data critical success factors, namely individuals, management, approaches, authorities, and companies. By answering three research questions during their survey, the authors tried to provide solutions to the key issues of big data analytics. Nonetheless, they did not research recently published articles to provide a more up-to-date SLR.

Rathore, et al. (Rathore et al. 2021) discussed the influencers on digital twinning. They identified research challenges and deficiencies that need to be worked on in the future to excel in digital twinning. The authors also divided their article into different sections, noting the scopes of manufacturing, medicine, transportation, education, business, and other industries in digital twinning. Naghib, et al. (Naghib et al. 2022) provided an SLR regarding the methods of how to manage big data in the Internet of Things (IoT). In their role as article organizers, they delineated four distinct categories: processes related to big data management (BDM), the BDM framework, quality attributes, and, finally, big data analytics Georgiadis and Poels (Georgiadis and Poels 2022) came up to the conclusion that although there have been numerous studies in big data security assessment, there is still room and potential that needs to be fulfilled by more pertinent and methodological rules to lower data protection risks in systems that store big data analytic algorithms.

Acciarini, et al. (Acciarini et al. 2023) focused on reviewing the benefits of business model innovation with the use of big data to unleash companies to reach a comprehensive understanding of diverse applications. The authors offered guidance on harnessing the potential of big data in the industry. Shah, et al. (Shah et al. 2023) provided an SLR regarding the applications of BDPA in Supply Chain Risk Management (SCRM). The authors analyzed 68 selected articles, categorized them based on publication year, country, journal, application areas, and tools used. Singh, et al. (Singh et al. 2023) reviewed the prospective plus points and challenges of big data analytics (BDA) in the healthcare industry. The authors highlighted the increasing adoption of BDA in healthcare while addressing the associated challenges. Although the article was based on an SLR, it offered possible solutions for healthcare challenges.

The reviewed studies are divided into three categories: survey, systematic mapping study (SMS), and SLR, which are depicted in Table 1. Considering the previous points, neither of the SLRs (Failed 2017b; Mikalef et al. 2018; Kolajo et al. 2019; Al-Sai et al. 2020; Rathore et al. 2021; Naghib et al. 2022; Georgiadis and Poels 2022) has reviewed BDPA holistically. Ohiomah, et al. (Failed 2017b) only reviewed articles between 2006 and 2017. Kolajo, et al. (Kolajo et al. 2019) concentrated on big data stream examination covering years of which were between 2004 and 2018. Al-Sai, et al. (Al-Sai et al. 2020) reviewed big data’s influential success elements between 2007 and 2019, which appeared to be lacking studies published after 2019; it would have been helpful to include more recent studies to ensure the findings were up to date The only article that is relatively close to our work is Mikalef, et al. (Mikalef et al. 2018), in which the authors only reviewed articles until 2018. Due to this, we state that the SLR that we have presented is the primary one trying to investigate BDPA thoroughly up to the 2023.

Table 1 Related studies in the field of BDPA

(“big data” < OR > “large data” < OR > Hadoop < OR > Spark < OR > Storm)
< AND >
(predictive < OR > forecasting < OR > prediction < OR > foresee)

Big data and predictive analytics: A systematic review of applications

Abstract

Similar content being viewed by others

Artificial intelligence-based solutions for climate change: a review

Big data in healthcare: management, analysis and future prospects

Trends and Future Perspective Challenges in Big Data

1 Introduction

2 Related work

3 Research methodology

3.1 Planning the systematic review

Stage 1 - Clarifying the research motivation

Stage 2—Formulating research questions

Stage 3—Establishing the review protocol

3.2 Conducting the systematic review

Stage 1 – Selecting primary articles

4 A classification for applications of big data predictive analytics

4.1 Industrial applications

4.1.1 Overview of industrial articles

4.1.2 Summary of industrial articles

4.2 E-commerce applications

4.2.1 Overview of e-commerce articles

4.2.2 Summary of e-commerce articles

4.3 Smart healthcare applications

4.3.1 Overview of smart healthcare articles

4.3.2 Summary of smart healthcare articles

4.4 Smart agriculture applications

4.4.1 Overview of smart agriculture articles

4.4.2 Summary of smart agriculture articles

4.5 Smart city applications

4.5.1 Overview of smart city articles

4.5.2 Summary of smart city articles

4.6 ICT applications

4.6.1 Overview of ICT articles

4.6.2 Summary of ICT articles

4.7 Weather applications

4.7.1 Overview of weather articles

4.7.2 Summary of weather articles

5 Discussion

5.1 Overview of the selected studies

5.2 Research aims, methods, and evaluation metrics

6 Open issues and future trends

7 Conclusion and limitation

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation