Introduction

Over the past 15 years, Big Data has emerged as a foundational pillar providing support to an extensive range of different scientific fields, from medicine and healthcare [1] to engineering [2], finance and marketing [3,4,5], politics [6], social networks analysis [7, 8], and telecommunications [9], to cite only a few examples. This 15-year period has witnessed a significant increase in research efforts aimed at unraveling the major problems in Big Data, with an almost innumerable array of potential solutions and data sources [10,11,12,13]. This has resulted in a boundless world of scientific papers that, in the end, have demonstrated the twofold, ambivalent nature of Big Data. On one side, in fact, we have had a confirmation of the pivotal role played by this scientific field in shaping the technological advancements of our time. On the other side, an approach to the comprehension of Big Data, based on this endless universe of ten of thousand technical papers, each specializing in its specific sector, however natural it might seem, has become not sustainable because it has often made researchers confuse (or mixing) the theory (of Big Data) with the practice or use of it. We cannot ignore that there have also been numerous active attempts to describe the general landscape of Big Data through survey papers. Nonetheless, again, given the vastness of the subject, the majority of them did not shun the trap of pre-formed models and have tried to respond, as closely as possible, to the concrete requirements coming from just one sub-field or from the point of view of a few perspectives. In this complex context, to take at least one step further into the knowledge of the state of the art of Big Data research over the above-mentioned period of time, we have decided to conduct a different form of comprehensive exploration which was not biased by the specificity of some given sectors or confounded by single technical perspectives. To do that, we have adopted the methodology termed systematic literature review (SLR), as proposed by Kitchenham and Charters [14] in the field of software engineering [15, 16]. Although SLR proceeds through a set of well-defined steps, also in this case, an initial choice has to be made regarding the most crucial parameters through which the subject of investigation should be explored. In the case of Big Data, our primary focus has been on gaining insights into the principal application domains of Big Data, unraveling the major challenges and limitations encountered by researchers in the analysis of the typically enormous datasets they manage, and unveiling the emerging trends and directions in future Big Data research.

Guided by the structured methodology imposed by SLR, we hence started with three research questions that matched the points raised before: essentially, (i) most common application domains, (ii) current research challenges and limitations, and (iii) emerging future trends and directions. From this point on, we proceeded following the SLR steps. Basically: first, we translated the three research questions above into specific search terms, through which five different digital libraries were investigated, namely: Scopus, IEEE Explore, ACM Digital Library, SpringerLink and Google Scholar. Upon completion of the search activity (detailed in the following section of this paper), 189 primary studies that matched our generic search criteria were identified. Of these 189, only 32 of these studies were actually reviews. Since the target of our study was to provide a panoramic view of this 15-year Big Data research period, (a) shedding light on the prevalent application domains, (b) highlighting the hurdles faced by researchers, and (c) finally outlining the potential trajectories for future research, we focused on the analysis of just these 32 survey studies.

With this paper, we do not want to conduct a traditional literature review on a very extensive topic like Big Data. Traditional scientific surveys can include many more studies and corresponding papers, and they are mainly built with an eye toward generalizability and inclusion rather than selectivity and relevance. As a consequence, those approaches often bring to us no much more than a mere summary of the topic of interest. SLRs, instead, start from the legitimate presumption to be more than merely a summary of a topic. In essence, they distinguish themselves from ordinary surveys of the available literature because they are specifically built to add to the identification of all publications on a topic also all the following activities: explicit formulation of a search objective, identification and description of a search procedure, definition of criteria for inclusion and exclusion of publications, literature selection, and information extraction only based on a transparent evaluation of the quality of publications. Not only this, but an SLR should also provide insightful information on the current state of research on a topic, starting from a given set of research questions and following a formal methodological procedure, designed to reduce distortions caused by an overly generous and restrictive selection of the literature, while guaranteeing the reliability of the selected publications. Hence, to pursue these objectives, an SLR should start with the definition of the criteria for determining what should be included/excluded before conducting the search. Not to mention that, typically, an SLR should be performed mainly using electronic literature databases. It should be also noticed that such a structured approach should document all the information gathered (and the steps taken as part of this process), with the aim of making the paper selection process completely visible and reproducible [17].

In the end, we know very well that a point-to-point analysis of the set of almost 320 papers from which we have started our SLR could have brought more (generic) information than that provided by the circa 30 papers finally selected by our SLR. Nonetheless, it is highly likely that this information would have been somewhat redundant, more prone to defects and personal biases, and finally, also more boring to read.

With this SLR, we aim to contribute, in a focused and structured way, to Big Data research in several ways: from one side, we provide researchers with a clear picture of how Big Data application domains changed over time; then, we highlight challenges faced by academia and industry and their evolution in the last 15 years; finally, we sketch a set of open points that researchers will take into consideration in the next future.

We can conclude that, while our collective understanding of Big Data has grown after this investigation, this analysis has underscored again the fact that in this field, a kind of optimal stability emerges in terms of research interests through the even distribution among applications domains/challenges/future trends. From one side, we observe a pervasive adoption of Big Data solutions in all everyday life domains (such as Energy [18], Smart Cities [9], and Healthcare [19].) On the other hand, researchers have spent a lot of effort managing data quality, designing and developing advanced frameworks to manage Big Data in real-time, focusing on security and privacy. However, many challenges still remain open to seamlessly integrate Big Data into data-driven advanced software solutions of the future, such as mitigating energy consumption, optimizing algorithms, increasing framework security with privacy and ethical focus, intersecting Artificial Intelligence and Machine Learning technologies, opening data sets, improving interoperability among different stakeholders, and considering societal and business changes.

The remainder of this paper is organized as follows: in Sect "Research method", we run the SLR methodology on our Big Data use case (with the definition of our research questions, the search strategy, the inclusion/exclusion criteria, the study quality assessment questionnaire, and the data extraction from primary studies). All this is in the dual attempt to explain the abstract methodology, as well as its application in our field. Section "SLR: implementation" describes how we conducted the review and the results obtained in each stage and step of our SLR; Section "SLR: results" shows our findings, briefly summarizing each of our selected primary studies; Section "Discussion" discusses critically those findings garnering special attention in our analytical process; Section "Threats to validity" discusses the possible threats to the validity of our study; Section "Conclusion" demonstrates the conclusions we drew for our SLR.

A taxonomy of key concepts for Big Data evolution over the last 15 years is presented in Fig. 1.

Fig. 1
figure 1

Taxonomy of Big Data evolution over the last 15 years

Research method

Research questions

This SLR has been conducted following the procedure defined by Kitchenham and Charters. As such, in the first step, we defined the research questions (RQ) that will drive the entire review methodology.

As we define the research questions that will guide our SLR, it is crucial to establish a balance between the breadth and depth of our investigation. After careful consideration and to ensure that our review maintains a focused and meaningful scope, it has been decided to narrow down our research questions to the following three:

  • RQ1: what are the most common application domains for Big Data analytics, and how have they evolved over time?

  • RQ2: what are the major challenges and limitations that researchers have encountered in Big Data analysis, and how have they been addressed?

  • RQ3: what are the emerging research trends and directions in Big Data that will likely shape the field in the next 5 to 10 years?

Search strategy

SLR begins by looking for relevant studies related to our research questions. To do this, we find appropriate search terms using the method outlined by Kitchenham and Charters, which suggests to consider three aspects: Population (P), Interventions (I), and Outcomes (O).

We identified the following relevant search terms for each aspect in our review:

  • Population: Big Data, real-time data analytics, large datasets.

  • Intervention: methodologies, techniques, domains, architectures, solutions.

  • Outcomes: research trends, future directions, emerging technologies, challenges, SLR, Systematic Literature Review.

The search string was constructed as follows:

$$\begin{aligned} (\ P_1 \ \ OR \ \ P_2...\ OR \ \ P_n \ ) \ AND \ (\ I_1 \ \ OR \ \ I_2 \ ...\ OR \ \ I_n \ ) \ AND \ (\ O_1 \ \ OR \ \ O_2 \ ...\ OR \ \ O_n \ ) \end{aligned}$$

P refers to population terms, I refers to intervention terms and O refers to outcome terms, all of which are connected through boolean operators AND and OR.

Searches string may take the exemplar form like the following:

(“big data” OR “real-time data analytics” OR “large datasets”) AND (“methodologies” OR “techniques” OR “domains” OR “architectures” OR “solutions”) AND (“research trends” OR “future directions” OR “emerging technologies” OR “challenges” OR “SLR” OR “Systematic Literature Review”)

Since we need to find and study primary studies related to our research questions, the selection of appropriate digital libraries/search engines to search for the articles needed is essential. For this reason, it has been decided to use the following state-of-the-art sources:

  • Scopus: a multidisciplinary database that covers a broad range of research fields.

  • IEEE Xplore: an invaluable resource for technology and engineering-related SLR.

  • ACM Digital Library: a comprehensive collection of relevant articles, conference papers, and journals focused on computer science and information technology.

  • SpringerLink: an extensive collection of academic articles in the fields that align closely with our research interests.

  • Google Scholar: a freely accessible web search engine that indexes scholarly literature across various disciplines.

We aim to ensure a comprehensive and focused literature search by utilizing these sources, thereby facilitating a thorough and methodical research.

Inclusion/Exclusion criteria

In this stage of the SLR, we need to make an accurate selection of the studies extracted. To do this, we must define some rigorous inclusion/exclusion criteria, to decide which studies are going to be useful for our purpose. To achieve this, studies were excluded based on the following criteria:

  1. 1.

    Studies published before the 15-year time frame

  2. 2.

    Studies in languages other than English

  3. 3.

    Exclude non-academic sources, including blogs, news articles, marketing materials, and reports from non-academic organizations

  4. 4.

    Studies that are only marginally related to Big Data or the specific topics within our research questions.

In conclusion, all those studies that are not cut off by the exclusion criteria above are to be considered as included. They are called “Primary Studies” (PS).

Study quality assessment

Kitchenham and Charters stresses the necessity of assessing the quality of primary studies to reduce bias and enhance the validity of the evaluation process. In our research, we employ a study quality assessment to make sure that we have only the most relevant results for our research.

To achieve this, we formulated a five question study quality questionnaire, which serves as the foundation for assessing the quality of the primary studies:

  • QA1: has the primary study established a well-defined research objective?

  • QA2: did the primary study comprehensively describe its research methods and data sources?

  • QA3: has the technique or approach undergone a trustworthy validation?

  • QA4: has the primary study effectively identified and discussed the significant challenges and limitations encountered in Big Data analysis?

  • QA5: are the findings, research trends, and directions clearly presented and directly connected to the study’s objectives or goals?

Hence, we applied the formulated questionnaire to the included PSs to assess their quality. The output of this SLR stage will be discussed in Section 4.

Data extraction

The data extraction process entails gathering relevant information from the chosen primary studies to address the research questions. To facilitate this process, we have created a dedicated data extraction form, as shown in Table 1. As suggested in Kitchenham and Charters, we used the test-retest process to check the consistency and accuracy of the extracted data with respect to the original sources. After finishing the data extraction for all the selected studies, we randomly selected 3 primary studies and performed a second extraction of the data. No inconsistencies were detected.

Table 1 Data extraction form

SLR: implementation

In this section, we describe step-by-step the implementation and execution of the different stages of our SLR. Figure 2 depicts the search stages followed and the resulting number of primary studies for each stage.

In stage 1, an automated search was performed by applying the search string to the digital libraries. The software used for the management of the references is Zotero (www.zotero.org), a popular choice for SLRs. We began the research using the following research string:

(“big data” OR “real-time data analytics” OR “large datasets”) AND (“methodologies” OR “techniques” OR “domains” OR “architectures” OR “solutions”) AND (“research trends” OR “future directions” OR “emerging technologies” OR “challenges” OR “SLR” OR “Systematic Literature Review”). As a result, we found a total of 4204 studies. The reason for this many results could be attributed mostly to the main topic of this SLR being “Big Data”, a hugely popular field, especially in the last few years.

In stage 2, we used the Zotero’s duplicate identification tool, and we found a total of 25 duplicates. Additionally, 1 duplicate was found manually, bringing the total number of results to 4178 articles.

In stage 3, studies were excluded based on the title and the language. Fortunately, all the documents were in English, so we just needed to focus on the title, eliminating what had no use for our research. This cut down the total number to 553.

In stage 4, we eliminated the articles whose abstracts had marginal or no interest at all to us. At the end of the process, 189 Primary Studies were left, 32 of which were SLRs.

To ensure the best quality possible for our SLR, we have collected generic information on all the 189 studies that passed the Primary Study check. This information is depicted in Figs. 3 and 4. We then proceeded with an in-depth full-text review for the 32 PSs, which are the main subject of our SLR.

Fig. 2
figure 2

Stages of the applied search strategy

Figure 3 depicts the distribution per year for all the 189 studies. Our SLR focuses on the evolution of Big Data in the last 15 years. In any case, no studies before 2012 were detected. The reason for this could be attributed to the fact that before then Big Data, as a research topic, was not as popular.

Fig. 3
figure 3

Number of filtered primary studies and number of total citations

Figure 4 represents the total number of citations per year for our selected 32 Primary Studies. The graph clearly shows that the most recent studies have not been cited as much. Particularly, even though the studies released in the last two years compose about one third of our selected primary studies (11 out of 32), we can see that they have not been cited as much in comparison to the previous years. The lower citation rate may indicate that recently, researchers have focused more on understudied areas or more recent emerging trends, suggesting that the field of Big Data is currently undergoing an evolution. However, further analysis of the quality, methodology and context of these studies is necessary for more concrete conclusions.

Fig. 4
figure 4

Number of total PSs per year

For further clarity, we elaborated Table 2 to represent the chosen articles by highlighting the first author’s family name, the venue, the title of each PS, and a short introduction that highlights the main findings of each PS. Note that the ”J” indicates that the article has been published in a journal.

Table 2 Primary studies selected

To better understand the influence of the selected Primary Studies over time, we created a bubble chart to show the most cited documents by aggregating the PSs with the same publication year (see Fig. 5). The size of each bubble is proportional to the number of citations.

Fig. 5
figure 5

Bubble chart showing the number of primary studies and total citations per year of publication

SLR: results

The study of the PSs allowed us to pinpoint exactly which research question (RQ1-RQ3) is answered by each primary study. Table 3 summarizes our findings.

Table 3 Research questions addressed by the primary studies

As previously stated, it is important to assess the quality of each study. In subsect. "Study quality assessment", we developed a brief questionnaire that would help us determine the quality of a primary study. Table 4 shows the results of this quality check. It uses a simple “Yes,” “No,” or “na” (used when we don’t have enough information to answer) to fill out the Quality Assessment questionnaire.

Table 4 Study quality assessment

From now on, we will briefly summarize each study and its findings.

PS1—A comprehensive and systematic literature review on the Big Data management techniques in the internet of things [20]

In this article, the authors explored the Big Data management techniques applied to the internet of things. Big Data was initially applied for healthcare monitoring, smart cities, and industrial systems. Over time, with the evolution of IoT, it expanded to include broader topics: healthcare applications involved health state monitoring and predictive modeling, smart cities encompassed traffic management, energy efficiency and security, while industrial systems employed Big Data to improve scalability and security. The application landscape broadened emphasizing the importance of quality attributes such as performance, efficiency, reliability, and scalability in ensuring the success of Big Data Analytics systems in IoT across ever-evolving domains.

The challenges and open issues in Big Data Analytics within IoT span various dimensions, including centralized architectures, energy consumption in data collection, blockchain limitations, communication challenges, and diverse data features.

For future research, the exploration of AI for intelligent mobile data collection will take on a more relevant role, combining compressive sensing with AI for communication challenges and utilizing new optimization algorithms for data processing. To ensure security and privacy in IoT, Big Data Analytics could involve cryptography mechanisms, a data perception layer and a lightweight framework with AI. Addressing these challenges is essential for advancing Big Data Analytics in the evolving landscape of IoT applications.

PS2—A comprehensive review on Big Data for industries challenges and opportunities [21]

The article explores the transformative impact of Big Data Analytics in power systems, mineral industries, and manufacturing. In power systems, it revolutionizes fault detection, enables early warning systems and predicts future electricity demand, enhancing reliability and decision-making. For mineral industries, Big Data improves data storage, processing and analytics, optimizing exploration, extraction, and resource management. In manufacturing, it facilitates data-driven decision-making, comprehensive product quality assessment, and streamlined supply chain management for increased operational efficiency.

The study also highlights challenges in implementing Big Data Analytics, emphasizing the crucial need for precise data quality assessment models and secure frameworks. Machine learning and data analytics play a pivotal role in overcoming challenges, particularly in fault detection, load forecasting, and reservoir management. The call for open-source databases and integration with machine learning addresses the scarcity of datasets, reflecting challenges in maximizing Big Data’s potential.

Furthermore, the paper recommends future research trends, including advanced data quality assessment models, frameworks for high-dimensional data and solutions for secure communication. Emphasizing open-source databases and integrating machine learning promotes a collaborative and transparent approach. The call for interpretable models reflects a trend toward understanding and optimizing Big Data Analytics. Overall, these recommendations shape the future direction of Big Data applications in diverse industries.

PS3—A survey on IoT Big Data current status, 13 V’s challenges, and future directions [22]

The document delves into the landscape of Big Data Analytics, particularly exploring its integration with the Internet of Things. Application domains such as energy, healthcare, transportation, and smart cities emerge prominently. The discussion unfolds how these domains have evolved, signalling a shift towards IoT-driven intelligent applications.

Within this expansive terrain, the study identifies and elucidates 13 major challenges encapsulated by the “13 V’s”. These challenges span traditional aspects like volume, velocity, and variety, extending to less common concerns like vagueness and location-aware data processing. The document also offers innovative solutions, like edge-based processing and semantic representation, as strategies to manage these complex challenges.

In regards to the future, the document outlines emerging trends anticipated to define the Big Data landscape in the coming 5 to 10 years. These include a focus on energy-efficient data acquisition, the integration of machine learning and deep learning for advanced analytics, a strategic emphasis on edge and fog infrastructures, the evolving paradigm of multi-cloud data management, a shift towards data-oriented network addressing, and the increasing adoption of blockchain technology. These trends collectively indicate a trajectory towards more efficient, scalable, and secure practices in Big Data Analytics, particularly within the realm of IoT applications.

PS4—A systematic literature review on features of deep learning in Big Data analytics [23]

The document navigates the evolution of Big Data, emphasizing challenges and the rise of machine learning, particularly Deep Learning. Machine learning’s widespread use, observed in areas like healthcare and finance, underscores its crucial role. Even in complex data scenarios, its effectiveness is evident, as demonstrated by the U.S. Department of Homeland Security’s success in identifying threats.

Recognizing a gap in existing research, the document proposes a review focusing on Deep Learning in Big Data Analytics. The goal is to explore features like hierarchical layers and high-level abstraction. The study emphasizes Deep Learning’s strength in handling extensive datasets, its versatility, and its ability to prevent over fitting.

This exploration into Big Data’s journey underscores the central role of machine learning. The proposed review, specifically focusing on Deep Learning in Big Data Analytics, not only captures current advancements but also suggests there’s more to discover in the future where Big Data and machine learning intersect.

PS5—A systematic survey of data mining and Big Data analysis in internet of things [24]

The document navigates through diverse applications of Big Data Analytics, illustrating its transformative journey across sectors. Notably, it tracks the evolution within healthcare and finance, showcasing how Big Data has become integral to these domains over time.

Going further, the research dives into the various challenges of Big Data analysis. It identifies three main challenges: dealing with societal changes, understanding how businesses use IoT, and solving technical issues like security and connectivity. The study emphasizes the need to adapt to society’s changing needs, categorize IoT uses in business and front technical problems for effective Big Data analysis.

Moreover, the research anticipates future trends, in particular the rising importance of Big Data frameworks in handling expansive IoT-generated data. The intersection of these frameworks with data mining in the IoT domain emerges as a pivotal focus, pointing toward exciting possibilities and potential paths for future research in the realm of Big Data.

PS6—Access methods for Big Data: current status and future directions [25]

The document explores diverse applications of Big Data Analytics in research, education, urban planning, transportation, environmental modeling, energy conservation, and homeland security, emphasizing its transformative potential.

It addresses challenges like heterogeneity, scale, timeliness, privacy, and the evolving processing paradigms due to data volume surpassing computational resources.

Future directions include the need for systems handling structured and unstructured data, embedded analytics for real-time processing, innovative paradigms, application frameworks, and advanced databases ensuring transactional semantics. The research underscores the importance of tools addressing ethical, security, and privacy concerns.

PS7—An industrial Big Data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities [26]

This research introduces an innovative Big Data pipeline designed for industrial analytics in manufacturing.

The pipeline excels in integrating legacy and smart devices, ensuring cross-network communication, and adhering to open standards, marking a significant evolution in the field. The document showcases the pipeline’s ability to handle complexities, integrate older systems, ensure reliability, and scale efficiently in industrial data analytics.

The future plan involves implementing the pipeline to validate its architecture, particularly in predictive maintenance for Wind Turbines and Air Handling Units, contributing to the evolving landscape of Big Data Analytics.

PS8—Applications of Big Data in emerging management disciplines: a literature review using text mining [27]

This study explores diverse applications of Big Data Analytics across twelve emerging management domains, emphasizing their dynamic nature over time.

It addresses adoption challenges, focusing on data quality, resource management, and distinguishing between the ability and capability of organizations in using Big Data Analytics. The research underscores the thoughtful adoption of Big Data Analytics and the importance of measuring its business value comprehensively. It acknowledges the difficulty of translating insights into real-time actionable items.

Looking forward, the study proposes a framework connecting emerging management domains with conventional practices, suggesting future research areas in human resources, marketing, sales, strategy, and services. The research emphasizes the need for in-depth exploration to integrate emerging domains into established management practices, providing valuable insights for research and practical application.

PS9—Applying Big Data analytics in higher education: a systematic mapping study [28]

The document conducts a thorough exploration of Big Data Analytics (BDA) in Higher Education Institutions from 2010 to 2020. It uncovers diverse BDA applications in three domains: Educational Quality, Decision-Making Process, and Information Management.

Challenges in BDA adoption include handling large data volumes, addressing privacy concerns, and dealing with resource constraints. The study emphasizes the need for practical outcomes, automated tools, and validated frameworks.

Despite robust research interest, the field exhibits immaturity, with a prevalence of conference papers indicating an early development stage. The study calls for increased empirical research to fortify the evidence base and foster a more mature BDA integration in higher education.

PS10—Artificial intelligence approaches and mechanisms for Big Data analytics: a systematic study [29]

The SLR explores AI-driven Big Data Analytics, emphasizing machine learning, knowledge-based reasoning, decision-making algorithms, and search methods. Applications, notably in supervised learning, aim to enhance precision and efficiency but grapple with complexity and scalability issues.

Challenges encompass processing vast, heterogeneous data, ensuring system security, and addressing qualitative parameters. Fog computing emerges as a potential solution, yet security concerns remain under-explored.

Emerging trends spotlight Big Data Analytics for IoT through fog computing, the need for enhanced algorithms handling extensive data, and the necessity to address data quality issues in unstructured formats.

PS11—Bibliometric mining of research directions and trends for Big Data [30]

The research identifies key application domains, with particular focus on China, and emerging directions such as Machine Learning and Healthcare.

Navigating challenges, the study introduces a semi-automatic method, utilizing blacklists and thesauri to enhance precision in identifying research directions. This favors a balance between automation and expert input.

The study forecasts Big Data’s future using a growth rate criterion, emphasizing Machine Learning and Deep Learning. Moreover, the study suggests applying its methodology not only to Big Data but also to various research areas, such as Machine Learning, showcasing its potential applicability in diverse research areas.

PS12—Big Data adoption: state of the art and research challenges [31]

The study explores the widespread adoption of Big Data Analytics across diverse sectors such as finance, education, healthcare, and more. It identifies a need for increased research in untapped areas like education and healthcare, suggesting potential transformative effects.

Challenges in current Big Data research include the need for refined theoretical models, adaptable data collection methods, and larger sample sizes to ensure accuracy. The study recommends a mixed-method approach to address these challenges effectively.

The study, although not explicitly stating upcoming trends, suggests a changing research focus in both developing and developed countries. It indicates a growing awareness of untapped opportunities, hinting at a future emphasis on specific situations and new factors in Big Data adoption.

PS13—Big Data analytics for data-driven industry: a review of data sources, tools, challenges, solutions, and research directions [32]

This research provides a comprehensive overview of Big Data Analytics. Exploring application domains, it traces Big Data’s historical integration across education, healthcare, finance, national security, and Industry 4.0 components like IoT and smart cities.

Delving into challenges, the research highlights skill shortages, dataset management, privacy, scalability, and intellectual property issues. Solutions range from software-defined data management to innovative truthfulness and privacy preservation methods.

Looking ahead, the study identifies some emerging trends: sourcing data from education and diverse IoT devices, refining pre-processing, advancing data management, enhancing privacy, and exploring deep learning methods. These trends forecast a dynamic future for Big Data Analytics, shaping the field in the next years.

PS14—Big Data analytics in healthcare: a systematic literature review and road map for practical implementation [33]

The paper conducts a thorough examination of Big Data Analytics (BDA) applications in healthcare, introducing the novel Med-BDA architecture.

Notably, the work addresses challenges inherent in BDA (such as increased costs, difficulty in acquiring a relevant skill set, rapidly expanding technology stack, and heightened management overhead), presenting a comprehensive road map to alleviate issues such as cost escalation and skill acquisition hurdles.

The document concludes by outlining the potential for extensions to Med-BDA and its applicability to diverse Big Data domains, showcasing a forward-looking perspective in BDA research and application.

PS15—Big Data analytics in telecommunications: literature review and architecture recommendations [34]

The document explores Big Data Analytics in TELCO, introducing LambdaTel as a proposed solution for batch and streaming data processing. It discusses Big Data Analytics applications like CRM and Customer Attrition.

Challenges, such as the lack of standardized architecture, are acknowledged. LambdaTel addresses these challenges through a structured approach, emphasizing security and recommending the usage Python.

While not explicitly talking about future trends, the document suggests a commitment to ongoing adaptation, seen in recommendations like Python usage, Dockerized implementation and the application of LambdaTel in a local Telco company for cross-selling/up-selling.

PS16—Big Data analytics meets social media: a systematic review of techniques, open issues, and future directions [35]

The document highlights social media’s transformative impact in healthcare, emphasizing its role in patient support and disease tracking. It emphasizes leveraging social platforms for patient support, disease prevention, and real-time tracking of contagious diseases.

The review highlights challenges in both content and network-oriented approaches, such as privacy concerns, scalability limitations, and accuracy enhancement with incomplete data. Comprehensive resolution remains an open frontier, requiring innovative solutions for privacy preservation and accurate predictions.

The paper also highlights emerging trends in Big Data Analytics, emphasizing real-time and predictive analysis, and addressing challenges in sentiment analysis. It identifies under explored areas like political and e-commerce applications, underscoring the expanding trajectory of Big Data Analytics. Furthermore, it emphasizes the evolving complexities of linguistic analysis, underlining the need for domain-dependent sentiment analysis, and addressing challenges like sarcasm detection.

PS17—Big Data and its future in computational biology: a literature review [36]

The document underscores the growing significance of Big Data in computational biology and healthcare, particularly in the conversion of healthcare records into digital formats. It highlights the major application domains, focusing on optimizing health and medical care through electronic health data.

Challenges include the under-utilization of electronic health data and the need to convert raw data into actionable information. Despite increasing interest, the field lacks comprehensive literature reviews.

The document outlines emerging trends in Big Data for computational biology and bio informatics. It emphasizes the pivotal role of volume, variety, and velocity in defining Big Data’s impact on bio informatics. Key technologies, including Hadoop and MapReduce, are discussed, illustrating their significance in the field. The integration of Big Data technology is shown to enhance biological findings and facilitate real-time identification of high-risk patients. However, limitations, such as narrow study focuses, are noted.

PS18—Big Data and sentiment analysis: a comprehensive and systematic literature review [37]

The document delves into the diverse applications of Big Data Analytics, spotlighting its evolution, notably in sentiment analysis for marketing and disaster response.

Challenges identified include data quality issues and the absence of standardized disaster-related datasets. The limitations of centralized data mining algorithms for distributed systems are acknowledged, urging exploration into other platforms (YARN is directly cited as an example). The analysis underscores the need for immediate and improved performance, emphasizing real-time analysis.

In the future, it is important for researchers to carefully look into specific methods like Hadoop, MapReduce, and deep learning. This will help us better understand what these methods are good at and where they might struggle.

PS19—Big Data applications on the internet of things: a systematic literature review [38]

This document explores the evolving applications of Big Data, from understanding customer sentiments to enhancing disaster response. Hadoop emerges as a popular framework.

Challenges include robust data acquisition from IoT devices, addressing security concerns and optimizing system scalability.

Future directions involve improving algorithms for efficiency, addressing energy consumption, and exploring the synergy of Big Data and machine learning for emergency systems.

PS20—Big Data in education: a state of the art, limitations, and future research directions [39]

The paper talks about how Big Data Analytics is used in various areas, especially in education, with a noticeable increase in publications from 2014 to 2019. It highlights important topics like how students behave, creating models, using data for education, improving systems, and adding Big Data (as a topic) to study plans.

Researchers face challenges in employing qualitative methods and data collection techniques, highlighting the need for quantitative approaches and more robust methodologies.

Future research should emphasize quantifying Big Data’s impact, adopting efficient solutions, exploring new tools and developing frameworks for educational applications. Integrating the concept of Big Data into study plans requires significant restructuring and well-designed learning activities.

PS21—Big Data in healthcare—a comprehensive bibliometric analysis of current research trends [40]

This document unveils the dynamic evolution of Big Data Analytics across diverse application domains, with a notable surge in research activities within the healthcare sector since 2012.

While the study discusses various related studies and challenges in Big Data analysis, it does not directly address or provide specific solutions to those challenges.

Looking ahead, the document reveals emerging trends and directions shaping the future of Big Data Analytics over the next 5 to 10 years. Key themes include data analytics, predictive analytics, and collaborative networks, providing a glimpse into the evolving landscape of research endeavors.

PS22—Big Data life cycle in shop-floor-trends and challenges [41]

The document explores Big Data Analytics in manufacturing, emphasizing its application domains like maintenance, automation, and decision-making.

Challenges include data measurement errors, high-frequency sampling issues, and the need for real-time processing. The study notes a shift to scalable storage options and highlights the importance of efficient data management.

Emerging trends involve the prominent role of AI and statistical approaches in data processing, coupled with a growing emphasis on data privacy. The study concludes with a call for future work focused on developing a consolidated framework for the Big Data life cycle in manufacturing.

PS23—Big Data testing techniques: taxonomy, challenges and future trends [42]

The paper explores the shift from traditional to advanced testing methods to address challenges in ETL processes, data quality, and node failures.

Addressing major challenges in Big Data analysis, the paper emphasizes the inadequacy of traditional testing, highlighting specific difficulties like ETL testing, node failure prevention, and unit-level debugging. It showcases evolving strategies employed by researchers to ensure the quality of Big Data systems.

Looking ahead, the document outlines emerging research trends shaping the future of Big Data Analytics. It identifies trends such as combinatorial testing techniques, fault tolerance testing, and model-driven entity reconciliation testing as key areas for future exploration.

PS24—Big Data with cognitive computing: a review for the future [43]

The paper explores the application domains of Big Data Analytics, highlighting its early stage in conjunction with cognitive computing, particularly in healthcare.

Challenges in adoption are attributed to a perceived lack of strategic value. The study categorizes issues into data, process, and management challenges, emphasizing the potential of integrating cognitive computing to overcome barriers.

Regarding emerging trends, there’s a rising interest in cognitive computing. The research encourages more global collaboration and highlights a gap in understanding how Big Data studies impact decision-making processes.

PS25—Current approaches for executing Big Data science projects-a systematic literature review [44]

The paper explores the landscape of Big Data Analytics. Regarding the common application domains and their evolution, the study notes a significant increase in articles. Workshops play a crucial role in shaping the trajectory, reflecting a robust and expanding interest in Big Data Analytics, influenced by technological advancements.

It also addresses challenges in Big Data analysis, with a focus on workflows and agility. While acknowledging the conceptual nature of agility papers, a gap between theoretical benefits and practical implementation is underscored, necessitating further exploration to optimize agile frameworks for data science projects.

The study highlights emerging trends in Big Data, emphasizing the need for integrated frameworks in data science. It points out a research gap in standardized approaches, urging further exploration for innovative methodologies.

PS26—Data quality affecting Big Data analytics in smart factories: research themes, issues and methods [45]

This review explores the growing applications of Big Data Analytics in Smart Factories, emphasizing an upsurge in empirical case studies on production, process monitoring, and quality tracing.

Challenges involve key data quality issues (missing, anomalous, noisy, and old data), as well as ISO-defined data quality dimensions. While technical methods prevail, an integrated approach combining technical and non-technical methods for comprehensive data quality management is highlighted. Theoretical insights focus on data quality dimensions, issues, and resolutions, while practical implications underscore the need for collaboration and integrated methods.

The study calls for future research in frameworks, data quality requirements, and emerging scenarios, contributing to Big Data Analytics evolution in Smart Factories.

PS27—Harnessing Big Data analytics for healthcare: a comprehensive review of frameworks, implications, applications, and impacts [46]

The study meticulously explores the landscape of Big Data Analytics in healthcare. Noteworthy application domains, such as multi modal data analysis and fusion, natural language processing, and electronic health records, emerge from this exploration.

Some challenges faced in Big Data analysis are presented in the document, highlighting issues like data quality, privacy concerns, and a shortage of skilled professionals. It emphasizes the necessity for interoperability and standardization while identifying ongoing challenges in multi modality, ethical considerations, and bias mitigation.

The research outlines emerging trends and directions in Big Data, emphasizing the importance of ongoing exploration in areas like multi modality, data mining, precision medicine, ethical considerations, and the broader understanding of the Big Data Ecosystem.

PS28—Leveraging Big Data in smart cities: a systematic review [47]

Big Data Analytics has evolved across diverse domains, expanding from finance and healthcare to smart cities and e-commerce. This evolution has been marked by a transformative impact on industries.

Challenges in Big Data, including security, privacy, and scalability issues, have prompted innovative solutions. Advanced encryption, anonymization techniques, and scalable computing frameworks address these concerns.

Looking ahead, emerging trends highlight the fusion of Big Data with AI, machine learning, and technologies like edge computing. Ethical considerations gain prominence and quantum computing’s potential is explored for handling massive datasets.

PS29—Roles and capabilities of enterprise architecture in Big Data analytics technology adoption and implementation [48]

The document explores the evolution and current state of Big Data Analytics, highlighting its diverse applications in domains like healthcare and finance.

Researchers have grappled with challenges such as data privacy and scalability, addressing them through innovations like advanced encryption and scalable algorithms.

Looking forward, emerging trends include the integration of Artificial Intelligence and Machine Learning for enhanced analytics and a growing focus on ethics and responsible data use. The intersection of Big Data with edge computing and IoT also opens new frontiers for real-time analytics.

PS30—Security and privacy challenges of Big Data adoption: a qualitative study in telecommunication industry [49]

The research investigates the evolution of Big Data Analytics applications across diverse domains, emphasizing healthcare, finance, marketing, and telecommunications.

Challenges include data security and privacy, addressed through advanced encryption and privacy-preserving techniques.

In the future, emerging trends highlight explainable AI, ethical data practices, and innovations in handling streaming data, graph databases, and blockchain integration.

PS31—The role of AI, machine learning, and Big Data in digital twinning: a systematic literature review, challenges, and opportunities [50]

The document explores diverse applications of Big Data Analytics across industries like healthcare, energy, and manufacturing. It underscores the evolution of these applications, highlighting a focus on optimization, diagnostics, and predictive analytics.

Challenges include data collection difficulties, picking the right AI models that are both accurate and fast and the ongoing need for standardization in digital twinning.

The document anticipates future trends, emphasizing the integration of AI, Machine Learning, and Big Data, particularly in digital twinning. It sets the stage for ongoing research in optimizing industrial processes, predictive analytics, healthcare, and smart city implementations.

PS32—The state of the art and taxonomy of Big Data analytics: view from new Big Data framework [51]

The document extensively explores the landscape of Big Data Analytics, emphasizing the dominant role of Hadoop while acknowledging the rise of Apache Spark in recent years.

Major challenges in the field involve handling diverse data formats, optimizing algorithms for evolving hardware configurations, and bridging the gap between complex systems and end-users through user-friendly visualization techniques.

It anticipates future advancements in applications, specifically in domains like e-commerce and the IoT, while expressing optimism about increased investments in Big Data technology.

Discussion

In the last 15 years, Big Data has found applications across various domains, evolving over time in line with the evolution of technologies and new business needs. Some of the most common application domains for Big Data Analytics include:

  • Business and Finance, for example, to detect fraud detection by analyzing large datasets and identifying patterns indicative of fraudulent activities or to study customer behavior, preferences, and trends to improve marketing strategies.

  • Healthcare, for example, to forecast disease outbreaks, patient admission rates, and treatment outcomes, or to personalize medicine with the analysis of genetic data for ad-hoc treatments.

  • Retail, for example, to automatically manage and optimize inventories, and stock levels by predicting demands, or to create recommender systems to targeted and segmented customers’ profiles.

  • Manufacturing, for example, to predict and schedule maintenance needs and potential equipment failures by analyzing sensor data, or to improve product quality by monitoring and analyzing production processes.

  • Telecommunications, for example, to optimize at real-time network performance and areas for improvement, or to predict customer churn by identifying factors and customers’ behaviors that contribute to customer churn.

  • Government, Public Services, and Transportation, for example, to plan efficient urban mobility, traffic management, and resource allocation in Smart Cities, or to predict and prevent criminal activities, or to optimize energy distribution and reduce wastage, or to optimize transportation routes, reduce delivery times, and vehicle fleets for efficiency and cost savings.

  • Media, Entertainment, and Education, for example, to recommend movies, music, or articles based on users’ behaviors and preferences, or to tailor content and advertising by studying users’ behaviors, or to improve educational impact by analyzing student performance.

In Fig. 6, we show the distribution of the studies addressing the three research questions (RQ1-RQ3), from which we has started initially our investigation: 31 PSs discuss common application domains where the use of Big Data solutions is relevant (RQ1); 30 PSs analyze research challenges and limitations of Big Data (RQ2); 28 PSs highlight emerging research trends and directions in Big Data (RQ3). The total number of papers addressing the 3 RQs is different from the number of the selected 32 PSs, since we observed overlaps and intersections (e.g., a PS can address multiple RQs.)

Fig. 6
figure 6

Distribution of studies addressing the three research questions

To better understand the main focus of the PSs, Fig. 7 shows the distribution of studies addressing the three research questions, but this time, we made it avoiding intersections (i.e., each primary study can only be part of one of the 3 categories.) We can classify 12 PSs as papers that mainly focus on RQ1, 10 PSs mainly focus on RQ2, and 10 PSs on RQ3. The homogeneous distribution of the primary studies allows us to be optimistic about the results of our research since we had a good number of studies to answer each of our research questions.

Fig. 7
figure 7

Distribution of studies mainly addressing the three research questions

To further make clear the main focuses of our studies, we decided to categorize each one. Figures 8, 9, and 10 show the focus of the documents for each Research Question (note that the sum of the categorized documents may be greater than the number of studies that answer that RQ, because they may overlap and be part of more than one category).

Fig. 8
figure 8

Categorization of RQ1 studies

Fig. 9
figure 9

Categorization of RQ2 studies

Fig. 10
figure 10

Categorization of RQ3 studies

Having clarified this, we now discuss the findings of our SLR. We divided this discussion in three sections, one for each Research Question, so that we could clearly define which elements answer which question.

RQ1: what are the most common application domains for Big Data analytics, and how have they evolved over time?

Delving into the realm of Big Data across various sectors over the last 15 years reveals a narrative of evolution and adaptation. Initially rooted in finance, healthcare and marketing, the domain of Big Data analytics has undergone a metamorphosis, embracing applications from computational biology to education and manufacturing, expanding into the avant-garde concept of digital twinning. This dynamic evolution is evident in studies investigating Big Data management techniques on the Internet of Things, where the focus has shifted from basic health state monitoring to sophisticated predictive modeling. This evolution signifies a maturation of Big Data analytics, with an increased focus on nuanced attributes like performance, efficiency, reliability, and scalability.

RQ2: what are the major challenges and limitations that researchers have encountered in Big Data analysis, and how have they been addressed?

Shifting our focus to the challenges within the Big Data analytics landscape, a complex history of persistent hurdles and inventive solutions comes into focus. The studies converge on a common thread, unraveling ongoing challenges encapsulated in the trio of data quality, scalability, and privacy/security concerns. Researchers faced with these challenges have become architects of innovative solutions, leveraging advanced algorithms, distributed frameworks, and privacy-preserving techniques. These solutions reflect a commitment to advancing the field in response to the complexities of handling vast and dynamic datasets.

In the implementation of Big Data Analytics, diverse challenges emerge. A dedicated study on industries points to crucial issues in data quality assessment models and secure frameworks. Here, the role of machine learning and data analytics, particularly in fault detection and reservoir management, becomes pivotal. The interconnected nature of these challenges emphasizes the importance of a comprehensive approach to implementation. Beyond technological challenges, ethical considerations surrounding data privacy and security take center stage. Researchers stress the significance of tools addressing ethical concerns, underlining that responsible deployment is intrinsic to the ethical use of Big Data Analytics.

In response to these challenges, the industry advocates for innovative solutions, emphasizing AI-driven approaches, cryptography mechanisms, and lightweight frameworks with AI. This recognition underscores the need for inventive strategies to navigate the intricate integration of Big Data into rapidly evolving technological landscapes.

RQ3: what are the emerging research trends and directions in Big Data that will likely shape the field in the next 5 to 10 years?

Looking into the next 5 to 10 years, several trends are expected to shape the landscape of Big Data Analytics. One significant trend involves making data acquisition more energy-efficient, a move that aligns with broader sustainability goals. The integration of machine learning and deep learning techniques is anticipated to enhance the analytical capabilities of Big Data systems, enabling more accurate predictions and insights. Another noteworthy trend is the emphasis on edge and fog infrastructures, signifying a shift towards decentralized processing for faster data processing and decision-making, especially relevant in the context of the Internet of Things. Importantly, these trends extend beyond technological advancements to include ethical considerations. As Big Data assumes a pivotal role in decision-making processes, these ethical dimensions must be at the forefront. This involves dealing with the tricky ethical issues that come with having such a big influence through data analytics.

In essence, the trajectory of Big Data analytics in the coming years is a dual journey, one that advances technologically with a keen eye on efficiency and, concurrently, prioritizes ethical practices. It’s a future where innovation and responsibility go hand in hand, defining a landscape that reflects both progress and ethical consciousness.

Threats to validity

Ensuring the validity of a SLR is essential for the development of a reliable study. For this reason, in this section, we examine potential threats to construct, internal and external validity, aiming to maintain the robustness of our findings.

Construct validity determines whether the implementation of the SLR aligns with its initial objectives. The efficacy of our search process and the relevance of search terms are crucial concerns. While our search terms were derived from well-defined research questions and adjusted based on that, the completeness and comprehensiveness of these terms may be subject to limitations. Additionally, the use of different keywords might have returned other relevant studies that have not been taken into consideration. A potential language bias may also exist due to the exclusion of non-English articles, representing a limitation that should be acknowledged in the overall validity of the research.

Internal validity assesses the extent to which the design and execution of the study minimize systematic errors. A key focus is on the process of data extraction from the selected primary studies. Some required data may not have been explicitly expressed or were entirely missing, posing a potential threat to internal validity. To minimize this risk, the SLR process has been supervised by another person in order to minimize error into the process.

External validity examines the extent to which the observed effects of the study can be applied beyond its scope. In this SLR, we concentrated on research questions and quality assessments to mitigate the risk of limited generalizability. However, the study’s focus on the specific domain of Big Data research may limit external validity. Moreover, the dynamic nature of Big Data and the predefined time frame (last 15 years) could affect the generalizability of findings. Recognizing these constraints, the outcomes of this SLR are considered generalizable within the specified context of Big Data research.

By acknowledging these potential threats to validity, we strive to enhance the credibility and reliability of our SLR, contributing valuable insights to the evolving landscape of Big Data research.

Conclusion

Over the past 15 years, Big Data has become a crucial player in various fields, adapting to technological shifts and meeting the changing needs of businesses. This review has taken a closer look at how Big Data has been applied, its challenges, and what we can expect in the near future. 189 studies were ultimately found, 32 of which were SLRs analyzed for this study.

Big Data started in areas like Business, Healthcare, and Marketing, but its influence has ultimately grown. Now, it helps predict disease outbreaks, manage retail inventory, forecast equipment failures in manufacturing, improve network performance, optimize urban planning, personalize media content, and enhance education.

Dealing with Big Data hasn’t been without challenges. Issues like ensuring data quality, handling scalability, and maintaining privacy and security have been persistent. Researchers have responded with creative solutions, using advanced algorithms and privacy measures.

Looking to the future, the trends suggest exciting developments. Making data acquisition more energy-efficient and integrating advanced machine learning techniques are on the horizon. There is a shift toward decentralized processing, especially with the Internet of Things in mind. Importantly, these trends aren’t just about technology; they also emphasize ethical considerations. Ethical issues need careful attention as Big Data becomes more influential in decision-making processes.

To summarize, the future of Big Data is a journey that combines technological progress with a strong ethical stance. It’s a path where innovation and responsibility walk hand in hand, shaping a landscape that advances both technologically and ethically. The last 15 years have set the stage and the road ahead invites us to keep exploring and engaging with the ever-evolving world of Big Data.