Introduction

Advances in information technology have facilitated large volume, high-velocity of data, and the ability to store data continuously leading to several computational challenges. Due to the nature of big data in terms of volume, velocity, variety, variability, veracity, volatility, and value [1] that are being generated recently, big data computing is a new trend for future computing.

Big data computing can be generally categorized into two types based on the processing requirements, which are big data batch computing and big data stream computing [2]. Big data batch processing is not sufficient when it comes to analysing real-time application scenarios. Most of the data generated in a real-time data stream need real-time data analysis. In addition, the output must be generated with low-latency and any incoming data must be reflected in the newly generated output within seconds. This necessitates big data stream analysis [3].

The demand for stream processing is increasing. The reason being not only that huge volume of data need to be processed but that data must be speedily processed so that organisations or businesses can react to changing conditions in real-time.

This paper presents a systematic review of big data stream analysis. The purpose is to present an overview of research works, findings, as well as implications for research and practice. This is necessary to (1) provide an update about the state of research, (2) identify areas that are well researched, (3) showcase areas that are lacking and need further research, and (4) build a common understanding of the challenges that exist for the benefit of the scientific community.

The rest of the paper is organized as follows: “Background and related work” section provides information on stream computing and big data stream analysis and the key issues involved in it and presents a review on big data streaming analytics. In “Research method” section, the adopted research methodology is discussed, while “Result” section presents the findings of the study. “Discussion” section presents a detailed evaluation performed on big data stream analysis, “Limitation of the review” section highlights the limitations of the study, while “Conclusion and further work” concludes the paper.

Background and related work

Stream computing

Stream computing refers to the processing of massive amount of data generated at high-velocity from multiple sources with low latency in real-time. It is a new paradigm necessitated because of new sources of data generating scenarios which include ubiquity of location services, mobile devices, and sensor pervasiveness [4]. It can be applied to the high-velocity flow of data from real-time sources such as the Internet of Things, Sensors, market data, mobile, and clickstream.

The fundamental assumption of this paradigm is that the potential value of data lies in its freshness. As a result, data are analysed as soon as they arrive in a stream to produce result as opposed to what obtains in batch computing where data are first stored before they are analysed. There is a crucial need for parallel architectures and scalable computing platforms [5]. With stream computing, organisations can analyse and respond in real-time to rapidly changing data. Streaming processing frameworks include Storm, S4, Kafka, and Spark [6,7,8]. The real contrasts between the batch processing and the stream processing paradigms are outlined in Table 1.

Table 1 Comparison between batch processing and streaming processing [82]

Incorporating streaming data into decision-making process necessitates a programming paradigm called stream computing. With stream computing, fairly static questions can be evaluated on data in motion (i.e. real-time data) continuously [9].

Big data stream analysis

The essence of big data streaming analytics is the need to analyse and respond to real-time streaming data using continuous queries so that it is possible to continuously perform analysis on the fly within the stream. Stream processing solutions must be able to handle a real-time, high volume of data from diverse sources putting into consideration availability, scalability and fault tolerance. Big data stream analysis involves assimilation of data as an infinite tuple, analysis and production of actionable results usually in a form of stream [10].

In a stream processor, applications are represented as data flow graph made up of operations and interconnected streams as depicted in Fig. 1. In a streaming analytics system, application comes in a form of continuous queries, data are ingested continuously, analysed and correlated, and stream of results are generated. Streaming analytic applications is usually a set of operators connected by streams. Streaming analytics systems must be able to identify new information, incrementally build models and access whether the new incoming data deviate from model predictions [9].

Fig. 1
figure 1

Data flow graph of a stream processor. The figure shows how applications (made up of operations and interconnected streams) are represented as data flow graph in a stream processor [10]

The idea of streaming analytics is that each of the received data tuples is processed in the data processing node. Such processing includes removing duplicates, filling missing data, data normalization, parsing, feature extraction, which are typically done in a single pass due to the high data rates of external feeds. When a new tuple arrives, this node is triggered, and it expels tuples older than the time specified in the sliding window (sliding window is a typical example of windows used in stream computing which keeps only the latest tuples up to the time specified in the windows). A window is referred to as a logical container for data tuples received. It defines how frequently data is refreshed in the container as well as when data processing is triggered [4].

Key issues in big data stream analysis

Big data stream analysis is relevant when there is a need to obtain useful knowledge from current happenings in an efficient and speedy manner in order to enable organisations to quickly react to problems, or detect new trends which can help improve their performance. However, there are some challenges such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness, load balancing, privacy issues, and accuracy [3, 11,12,13,14,15,16,17,18] which arises from the nature of big data streams that must be dealt with.

Scalability

One of the main challenges in big data streaming analysis is the issue of scalability. The big data stream is experiencing exponential growth in a way much faster than computer resources. The processors follow Moore’s law, but the size of data is exploding. Therefore, research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data.

Integration

Building a distributed system where each node has a view of the data flow, that is, every node performing analysis with a small number of sources, then aggregating these views to build a global view is non-trivial. An integration technique should be designed to enable efficient operations across different datasets.

Fault-tolerance

High fault-tolerance is required in life-critical systems. As data is real-time and infinite in big data stream computing environments, a good scalable high fault-tolerance strategy is required that allows an application to continue working despite component failure without interruption.

Timeliness

Time is of the essence for time-sensitive processes such as mitigating security threats, thwarting fraud, or responding to a natural disaster. There is a need for scalable architectures or platforms that will enable continuous processing of data streams which can be used to maximize the timeliness of data. The main challenge is implementing a distributed architecture that will aggregate local views of data into global view with minimal latency between communicating nodes.

Consistency

Achieving high consistency (i.e. stability) in big data stream computing environments is non-trivial as it is difficult to determine which data are needed and which nodes should be consistent. Hence a good system structure is required.

Heterogeneity and incompleteness

Big data streams are heterogeneous in structure, organisations, semantics, accessibility and granularity. The challenge here is how to handle an always ever-increasing data, extract meaningful content out of it, aggregate and correlate streaming data from multiple sources in real-time. A competent data presentation should be designed to reflect the structure, diversity and hierarchy of the streaming data.

Load balancing

A big data stream computing system is expected to be self-adaptive to data streams changes and avoid load shedding. This is challenging as dedicating resources to cover peak loads 24/7 is impossible and load shedding is not feasible when the variance between the average load and the peak load is high. As a result, a distributing environment that automatically streams partial data streams to a global centre when local resources become insufficient is required.

High throughput

Decision with respect to identifying the sub-graph that needs replication, how many replicas are needed and the portion of the data stream to assign to each replica is an issue in big data stream computing environment. There is a need for good multiple instances replication if high throughput is to be achieved.

Privacy

Big data stream analytics created opportunities for analyzing a huge amount of data in real-time but also created a big threat to individual privacy. According to the International Data Cooperation (IDC), not more than half of the entire information that needs protection is effectively protected. The main challenge is proposing techniques for protecting a big data stream dataset before its analysis.

Accuracy

One of the main objectives of big data stream analysis is to develop effective techniques that can accurately predict future observations. However, as a result of inherent characteristics of big data such as volume, velocity, variety, variability, veracity, volatility, and value, big data analysis strongly constrain processing algorithms spatio-temporally and hence stream-specific requirements must be taken into consideration to ensure high accuracy.

Related work

This section discusses some of the previous research efforts that relate to big data streaming analytics.

The work of [13] presented a review of various tools, technologies and methods for big data analytics by categorizing big data analytics literature according to their research focus. This paper is different in that it presents a systematic literature review that focused on big data “streaming” analytics.

Authors in [19] presented a systematic review of big data analytics in e-commerce. The study explored characteristics, definitions, business values, types and challenges of big data analytics in the e-commerce landscape. Likewise, [20] conducted a study that is centred on big data analytics in technology and organisational resource management specifically focusing on reviews that present big data challenges and big data analytics methods. Although they are systematic reviews, the focus is not, particularly on big data streaming.

Authors in [21] presented the status of empirical research and application areas in big data by employing a systematic mapping method. In the same vein, authors in [22] also conducted a survey on big data technologies and machine learning algorithms with a particular focus on anomaly detection. A systematic review of literature which aims to determine the scope, application, and challenges of big data analytics in healthcare was presented by [23]. The work of [2] presented a review of four big data streaming tools and technologies. While the study conducted in this paper provided a comprehensive review of not only big data streaming tools and technologies but also methods and techniques employed in analyzing big data streams. In addition, authors [2] did not provide a clear explanation of the methodical approach for selecting the reviewed papers.

Research method

The study was grounded in a systematic literature review of tools and technologies with methods and techniques used in analysing big data streams by adopting [24, 25] as models.

Research question

The study tries to answer the following research questions:

  • Research Question 1: What are the tools and technologies employed for big data stream analysis?

  • Research Question 2: What methods and techniques are used in analysing big data streams?

  • Research Question 3: What do these tools and technologies have in common and their differences in terms of concept, purpose and capabilities?

  • Research Question 4: What are the limitations and strengths of these tools and technologies?

  • Research Question 5: What are the evaluation techniques or benchmarks used for evaluating big data streaming tools and technology?

Search string

Creating a good search string requires structuring in terms of population, comparison, intervention and outcome [24]. Relevant publications were identified by forming a search string that combined keywords driven by the research questions earlier stated. The searches were conducted by employing three standard database indexes, which are Scopus, Science Direct and EBSCOhost. The search string is “big data stream analysis” OR “big data stream technologies” OR “big data stream framework” OR “big data stream algorithms” OR “big data stream analysis tools” OR “big data stream processing” OR “big data stream analysis reviews” OR “big data stream literature review” OR “big data stream analytics”.

Data sources

As research becomes increasingly interdisciplinary, global and collaborative, it is expedient to select from rich and standard databases. The databases consulted are as follows:

  1. i.

    ScopusFootnote 1: Scopus is a bibliographic database containing abstracts and citations for academic journal articles launched in 2004. It covers nearly 36,377 titles from over 11,678 publishers of which 34,346 are peer-reviewed journals, delivering a comprehensive overview of the world’s research output in the scientific, technical, medical, and social sciences (including arts and humanities). It is the largest abstract and citation database of peer-reviewed literature.

  2. ii.

    ScienceDirectFootnote 2: ScienceDirect is Elsevier’s leading information solution for researchers, students, teachers, information professionals and healthcare professionals. It provides both subscription-based and open access-based to a large database combining authoritative, full-text scientific, technical and health publications with smart intuitive functionality. It covers over 14 million publications from over 3800 journals and more than 35,000 books. The journals are grouped into four categories: Life Sciences, Physical Sciences and Engineering, Health Sciences, and Social Sciences and Humanities.

  3. iii.

    EBSCOhostFootnote 3: EBSCOhost covers a wide range of bibliographic and full-text databases for researchers, providing electronic journal service available to both corporate and academic researchers. It has a total of 16,711 journals and magazine indexed and abstracted of which 14,914 are peer-reviewed; more than 900,000 high-quality e-books and titles and over 60,000 audiobooks from more than 1500 major academic publishers.

  4. iv.

    ResearchGateFootnote 4: A free online professional network for scientists and researchers to ask and answer questions, share papers and find collaborators. It covers over 100 million publications from over 11 million researchers. ResearchGate was used as a secondary source where the authors could not access some papers due to lack of subscription.

Data retrieval

The search was conducted in Scopus, ScienceDirect and EBSCOhost since most of the high impact journals and conferences are indexed in these set of rich databases. Boolean ‘OR’ was used in combining the nine (9) search strings. A total of 2295 articles from the three databases were retrieved as shown in Table 2.

Table 2 First search string result

Further refinement was performed by (i) limiting the search to journals and conference papers; (ii) selecting computer science and IT related as the subject domain; (iii) selecting ACM, IEEE, SpringerLink, Elsevier as sources; and year of publication to between 2004 and 2018. The year range was selected due to the fact that interest in big data stream analysis actually started in 2004. At this stage, a total of 1989 papers were excluded leaving a total of 315 papers (see Table 3). The result of the search string was exported to PDF.

Table 3 Second search string result

By going through the title of the papers, 111 seemingly relevant papers were extracted excluding a total number of 213 that were not relevant at this stage (see Table 4).

Table 4 Third Search string refinement result

The abstracts of 111 papers and introduction (for papers that the abstracts were not clear enough) were then read to have a quick overview of the paper and to ascertain whether they are suitable or at variance with the research questions. The citations of the papers were exported to Microsoft Excel for easy analysis. The papers were grouped into three categories; “relevant”, “may be relevant” and “irrelevant”. The “relevant” papers were marked with black colour, “may be relevant” and “irrelevant” with green and red colours respectively. At the end of this stage, 45 papers were classified as “relevant”, 9 papers as “may be relevant” and 11 as “irrelevant”. Looking critically at the abstract again, 18 papers were excluded by using the exclusion criteria leaving a total of 47 papers (see Table 5) which were manually reviewed in line with the research questions.

Table 5 Final Selection

Inclusion criteria

Papers published in journals, peer-reviewed conferences, workshops, technical and symposium from 2004 and 2018 were included. In addition, the most recent papers were selected in case of papers with similar investigations and results.

Exclusion criteria

Papers that belong to the following categories were excluded from selection as part of the primary study: (i) papers written in source language other than English; (ii) papers with an abstract and or introduction that does not clearly define the contributions of the work; (iii) papers whose abstract do not relate to big data stream analysis.

Result

The findings of the study are now presented with respect to the research questions that guided the execution of the systematic literature review.

Research Question 1: What are the tools and technologies employed for big data stream analysis?

Big data stream platforms provide functionalities and features that enable big data stream applications to develop, operate, deploy, and manage big data streams. Such platforms must be able to pull in streams of data, process the data and stream it back as a single flow. Several tools and technologies have been employed to analyse big data streams. In response to the growing demand for big data streaming analytics, a large number of alternative big data streaming solutions have been developed both by the open source community and enterprise technology vendors. According to [26], there are some factors to consider when selecting big data streaming tools and technologies in order to make effective data management decisions. These are briefly described below.

Shape of the data

Streaming data sources require serialization technologies for capturing, storing and representing such high-velocity data. For instance, some tools and technologies allow projection of different structures across data stores, giving room for flexibility for storage and access of data in different ways. However, the performance of such platforms may not be suitable for high-velocity data.

Data access

There is a need to put into consideration how the data will be accessed by users and applications. For instance, many NoSQL databases require specific application interfaces for data access. Hence there is a need to consider the integration of some other necessary tools for data access.

Availability and consistency requirement

If a distributed system is needed, then CAP theorem states that consistency and availability cannot be both guaranteed in the presence of network partition (i.e. when there is a break in the network). In such a scenario, consistency is often traded off for availability to ensure that requests can always be processed.

Workload profile required

Platform as a service deployment may be appropriate for a spike load profile platform. If platform distribution can be deployed on Infrastructure as a service cloud, then this option may be preferred as users will need to pay only when processing. On-premise deployment may be considered for predictable or consistent loads. But if workloads are mixed (i.e. consistent flows or spikes), a combination of cloud and on-premise approach may be considered so as to give room for easy integration of web-based services or software and access to critical functions on the go.

Latency requirement

If a minimal delay or low latency is required, key-value stores may be considered or better still, an in-memory solution which allows the process of large datasets in real-time is required in order to optimize the data loading procedure.

The tools and technologies for big data stream analysis can be broadly categorized into two, which are open source and proprietary solutions. These are listed in Tables 6 and 7.

Table 6 Open source tools and technologies for big data stream analysis
Table 7 Proprietary tools and technologies for big data stream analysis

The selection of big data streaming tools and technologies should be based on the importance of each factor earlier mentioned in this section. Proprietary solutions may not be easily available because of pricing and licensing issues. While open source supports innovation and development at a large scale, careful selection must be made especially when choosing a recent technology still in production due to limited maturity and lack of support from academic researchers or developer communities. In addition, open source solutions may lead to outdating and modification challenges [27]. Moreover, the selection of whether proprietary or open source or combination of both should depend on the problem to address, the understanding of the true costs, and benefits of both open and proprietary solutions.

Research Question 2: What methods and techniques are used in analysing big data streams?

Given the real-time nature, velocity and volume of social media streams, the clustering algorithms that are applied on streaming data must be highly scalable and efficient. Also, the dynamic nature of data makes it difficult to know the required or desirable number of clusters in advance. This renders partitioning clustering techniques (such as k-median, k-means and k-medoid) or expectation-maximization (EM) algorithms-based approaches unsuitable for analysing real-time social media data because they require prior knowledge of clusters in advance. In addition, due to concept drift inherent in social media streams, scalable graph partitioning algorithms are not also suitable because of their tendency towards balanced partitioning. Social media streams must be analysed dynamically in order to provide decisions at any given time within a limited space and time window [28,29,30].

Density-based clustering algorithm (such as DenStream, OpticStream, FlockStream, Exclusive and Complete Clustering) unlike partitioning algorithms does not require apriori number of clusters in advance and can detect outliers [31]. However, the issue with density-based clustering algorithms is that most of them except for few like HDDStream, PreDeCon-Stream and PKS-Stream (which are memory intensive) perform less efficiently in the face of high dimensional data and as a result are not suitable for analyzing social media streams [32].

Threshold-based techniques, hierarchical clustering, and incremental clustering or online clustering are more relevant to social media analysis. Several online threshold-based stream clustering approaches or incremental clustering approaches such as Markov Random Field [33, 34], Online Spherical K-means [35], and Condensed Clusters [36] have been adopted. Incremental approaches are suitable for continuously generated data grouping by setting a maximum similarity threshold between the incoming stream and the existing clusters. Much work has been done in improving the efficiency of online clustering algorithms, however, little research efforts have been directed to threshold and fragmentation issues. Incremental algorithm threshold setting should employ adaptive approach instead of relying on static values [37, 38]. Some of the methods and techniques that have been employed in analysing big data streams are outlined in Table 8.

Table 8 Methods and techniques for big data stream analysis

Many researchers have looked at the aspect of the real-time analysis of big data streams but not much attention has been directed towards social media stream preprocessing. For instance, the social media stream is characterized by incomplete, noisy, slang, abbreviated words. Also, contextual meaning of social media post is essential for improved event detection, sentiment analysis or any other social media analytics algorithms in terms of quality and accuracy [36, 39]. There is the need to give more attention to the preprocessing stage of social media stream analysis in the face of incomplete, noisy, slang, and abbreviated words that are pertinent to social media streams. These challenges create opportunities application of new semantic technology approaches, which are more suited to social media streams [40, 41].

Research Question 3: What do big data streaming tools and technologies have in common and their differences in terms of concept, purpose, and capabilities?

The features of various tools and technologies for big data stream were compared in order to answer this question. An overview analysis based on 10 dimensions, which are database support, execution model, workload, fault-tolerance, latency, throughput, reliability, operating system, implementation languages and application domain or areas is presented in Table 9.

Table 9 Comparison of big data streaming tools and technologies

For organisations with existing applications that have support for SQL, MySQL, SQL Server, Oracle Database, for instance, may consider choosing big data streaming tools and technologies that have support for their existing databases. There are few big data streaming tools and technology that support virtually any data format. An example of such is Infochimps Cloud.

The major big data streaming tools and technologies considered are all suitable for streaming execution model, however out of 19 big data tools and technology compared and contrasted in this section, only 10.5% is suitable for streaming, batch, and iterative processing while 47.4% can handle jobs requiring both batch and streaming processing. It is safer for a job to be executed on a single platform which can accommodate all the dependencies required in order to avoid interoperability constraints than combining two or more platforms or frameworks. The best fit with respect to the choice of big data streaming tools and technologies will depend on the state of data to process, infrastructure preference, business use case, and kind of results interested in.

Virtually all the big data streaming tools and technologies are memory intensive. This implies that the main performance bottleneck at higher load conditions will be due to lack of memory [42]. However, research has shown that the benefit of high intensive memory applications outweighs the performance loss due to long memory latency [43].

From all the big data streaming tools and technologies reviewed, only IBMInfoSphere and TIBCO StreamBase support all of the three “at-most-once” “at-least-once” and “exactly-once” message delivery mechanisms while others support one or two of the three delivery mechanisms. “At-most-once” is the cheapest with least implementation overhead and highest performance because it can be done in a fire-and-forget fashion without keeping the state in the transport mechanism or at the sending end. “At-least-once” delivery requires multiple attempts in order to counter transport losses which means keeping the state at the sending end and having an acknowledgement mechanism at the receiving end. “Exactly-once” is the most expensive and has consequently worst performance because, in addition to “at-least-once” delivery mechanism, it requires the state to be kept at the receiving end in order to filter duplicate deliveries. In other words, “at-most-once” delivery mechanism implies that the message may be lost while “at-least-once” delivery ensures that messages are not lost and “exactly-once” implies that message can neither be lost nor duplicated. “Exactly-once” is suitable for many critical systems where duplicate messages are unacceptable.

Research Question 4: What are the limitations and strengths of big data streaming tools and technologies?

Observations from the literature reveal that specific big data streaming technology may not provide the full set of features that are required. It is rare to find specific big data technology that combines key features such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness management, and load balancing. For instance, Spark streaming [16] and Sonora [44] are excellent and efficient for checkpointing but the operator space available to user codes are limited. S4 does not guarantee 100% fault-tolerant persistent state [45]. Storm does not guarantee the ordering of messages due to its “at-least-once” mechanism for record delivery [46, 47]. Strict transaction ordering is required by Trident to operate [48]. While streaming SQL provide simple and succinct solutions to many streaming problems, the complex application logic (such as matrix multiplication) and intuitive state abstractions are expressed with the operational flow of an imperative language rather than a declarative language such as SQL [49,50,51].

Moreover, BlockMon uses batches and cache locality optimization techniques for memory allocation efficiency and data speed up access. However, deadlock may occur if data streams are enqueued with a higher rate than that of the block consumption [52]. Apache Samza solves batch latency processing problems but requires an added layer for flow control [53]. Flink is suitable for heavy stream processing and batch-oriented tasks although it has scaling limitations [46]. Redis’ in-memory data store makes it extremely fast although this implies that available memory size determines the size of the Redis data store [54]. While C-SPARQL and CQELS are excellent for combining static and streaming data, they are not suitable when scalability is required [55]. SAMOA is suitable for machine learning paradigm as it focuses on speed/real-time analytics, scales horizontally and is loosely coupled with its underlying distributed computation platform [56]. With Lambda architecture, a real-time layer can complement the batch processing one thereby reducing maintenance overhead and risk for errors as a result of duplicate code bases. In addition, Lambda architecture handles reprocessing, which is one of the key challenges in stream processing. Two main problems with Lambda architecture are code maintenance in two complex distributed systems that need to produce the same result and high operational complexity [57, 58].

Summarily, there exists various tools and technologies for implementing big data streams and there seems to be no big data streaming tool and technology that offers all the key features required for now. While each tool and technology may have its strengths and weaknesses, the choice depends on the objective of the research and data availability. A decision in favour of the wrong technology may result in increased overhead cost and time. The decision should take into consideration empirical analysis along with system requirements. In addition, research efforts should also be directed to how to improve on existing big data streaming tools and technologies to provide key features such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness management, and load balancing.

Research Question 5: What are the evaluation techniques or benchmarks that are used for evaluating big data streaming tools and technologies?

The diversity of big data poses a challenge when it comes to developing big data benchmarks that will be suitable for all workload cases. One cannot stick to one big data benchmark because it has been observed that using only one benchmark on different data sets do not give the same result. This implies that benchmark testing should be application specific. Subsequently, in evaluating big data system, the identification of workload for an application domain is a prerequisite [59]. Most of the existing big data benchmarks are designed to evaluate a specific type of systems or architectures. For instance, HiBench [60] is suitable for benchmarking Hadoop, Spark and streaming workloads, GridMix [61] and PigMix [62] are for MapReduce Hadoop systems. BigBench [63, 64] is suitable for benchmarking Teradata Aster DBMS, MapReduce systems, Redshift database, Hive, Spark and Impala. Presently, BigDataBench [65, 66] seems to be the only big data benchmark that can evaluate a hybrid of different big data systems.

So far, many researchers have evaluated their work by making use of synthetic and real-life data. Standard benchmark dataset for big data streaming analytics has not been widely adopted. However, few of the researchers that used standardized benchmarking are briefly discussed below. The work of [67] was tested with two benchmarks; Word Count and Grep. The result showed that the proposed algorithm can effectively handle unstable input and the delay of the total event can be limited to an expected range.

The tool developed by [68] was tested on both car dataset and WikinewsFootnote 5 dataset in comparison with sequential processing. It was discovered that their tool (pipeline implementation) performed better and faster.

Krawczyk and Wozniak used several benchmark datasets which include Breast-Wisconsin, Pima, Yeast3, Voting records, CYP2C19 isoform, RBF for estimating weights for the new incoming data stream with their proposed method against other standard methods. They also analysed time and memory requirements. Experimental investigation result proved that the proposed method can achieve better [69].

A benchmark evaluation using an English movie review dataset collected from Rotten Tomatoes website (a de facto benchmark for analysing sentiment applications) was conducted by [70], the result showed that sentiment analysis engine (SAE) proposed by the authors outperformed the bag of words approach.

Authors’ suite of ideas in [71] outperformed state-of-the-art searching technique called EBSM. The work of [72] used various datasets such as KDD-Cup 99, Forest Cover type, Household power consumption, etc. They compared their algorithm—parallel K-means clustering with k-means and k-means++, the result showed that their algorithm performed better in terms of speed.

Mozafari et al. in [73] benchmarked their system, XSeq against other general-purpose XML engines. The system outperformed other complex event processing engines by two orders of magnitude improvement.

Authors in [74] evaluated their work in terms of time, accuracy and memory using Forest cover type, Poker hand, and electricity datasets. They compared their method, adaptive windowing based online ensemble (AWOE) with other standard methods such as accuracy updated ensemble (AUE), online accuracy updated ensemble (OAUE), accuracy weighted ensemble (AWE), dynamic weighted majority (DWM) and Lev Bagging (Lev). Their proposed approach outperformed other methods in three perspectives which include suitability in terms of different type of drifts, better resolved appropriate size of block, and efficiency.

The evaluation performed by [75] using FACup and Super Tuesday datasets showed that their method, which is a hybrid of topic extraction methods (i.e. a combination of feature pivot and document pivot) has high efficiency and accuracy with respect to recall and precision.

Evaluating the performance of low-rank reconstruction and prediction scheme, specifically, singular spectrum matrix completion (SS-MC) proposed by [76], SensorScope Grand St-Bernard datasetFootnote 6 and Intel Berkeley Research Lab datasetFootnote 7 were used. The authors compared their proposed method with three state-of-the-art methods; KNN-imputation, RegEM and ADMM version of MC and discovered that their method outperformed the other methods in terms of pure reconstruction as well as in the demanding case of simultaneous recovery and prediction.

The authors in [77] evaluated their work using World Cup 1998 and CAIDA Anonymized Internet Traces 2011 datasets. When their method, ECM-Sketch (a sketch synopsis that allows effective summarization of streaming data over both time-based and count-based sliding windows) was compared with three state-of-the-art algorithms (Sketch variants); ECM-RW, ECM-DW, and ECM-EH, variants using randomized waves, deterministic waves and exponential histograms respectively, their method reduce memory and computational requirements by at least one order of magnitude with a very small loss in accuracy.

The work of [78] centred on benchmarking real-time vehicle data streaming models for a smart city using a simulator that emulates the data produced by a given amount of simultaneous drivers. Experiment with the simulator shows that streaming processing engine such as Apache Kafka could serve as a replacement to custom-made streaming servers to achieve low latency and higher scalability together with cost reduction.

A benchmark among Kyvos Insight, Impala and Spark conducted by [79] shows that Kyvos Insight performed analytical queries with much lower latencies when there is a large number of concurrent users due to pre-aggregation and incremental code building [80].

Authors in [81] proposed that in addition to execution time and resource utilization, microarchitecture-level and energy consumption are key to fully understanding the behaviour of big data frameworks.

In addition, to strengthen the confidence of big data research evaluation or result, application of empirical methods (i.e. tested or evaluated concept or technology for evidence-based result) should be highly encouraged. The current status of empirical research in big data stream analysis is still at an infant stage. The maturity of a research field is directly proportional to the number of publications with empirical result [20, 21]. According to [21] that conducted a systematic literature mapping to verify the current status of empirical research in big data, it was found out that only 151 out of 1778 studies contained empirical result. As a result, more research efforts should be directed to empirical research in order to raise the level of confidence of big data research outputs than it is at present.

Moreover, only a few big data benchmarks are suitable for different workloads at present. Research efforts should be geared towards advancing benchmarks that are suitable for evaluating different big data systems. This would go a long way to reduce cost and interoperability issue.

Discussion

From the analysis, it was observed that there has been a wave of interest in big data stream analysis since 2013. The number of papers produced in 2012 was doubled in 2013. In the same vein, more than double of the papers in 2013 were produced in 2014. There was a relative surge in 2017 having a total of 98 paper while the year 2018 received 156 papers (see Tables 9, 10 and Fig. 2). The percentage of papers analyzed from journals was 50%; that of conferences was 41% while that of workshop/technical/symposium was 9% as depicted in Fig. 3. Figure 4 presented the frequency of research efforts from different geographical locations with researchers from China taking the lead.

Table 10 Distribution of papers over the studied years
Fig. 2
figure 2

Magnitude of change in paper distribution. The figure shows the magnitude of change in paper distribution over the studied years (i.e. 2004 to 2018)

Fig. 3
figure 3

Percentage of publication type. The figure shows percentage of 381 papers from journals (50%), conferences (41%), and workshop/technical/symposium (9%)

Fig. 4
figure 4

Frequency of researchers across different. The figure presented the frequency of research geographical locations research efforts from different geographical locations

The selection of big data streaming tools and technologies should be based on the importance of each of the factors such as the shape of the data, data access, availability and consistent requirements, workload profile required, and latency requirement. Careful selection with respect to open source technology must be made especially when choosing a recent technology still in production. Moreover, the problem to address, the understanding of the true costs, and benefits of both open and proprietary solutions are also vital when making a selection.

A lot of research efforts have been directed to big data stream analysis but social media stream preprocessing is still an open issue. Due to inherent characteristics of social media stream which include incomplete, noisy, slang, abbreviated words, social media streams present a challenge to big data streams analytics algorithms. There is the need to give more attention to the preprocessing stage of social media stream analysis in the face of incomplete, noisy, slang, and abbreviated words that are pertinent to social media streams in order to improve big data streams analytics result.

Out of 19 big data streaming tools and technologies compared, 100% support streaming, 47.4% can do both batch and streaming processing while only 10.5% support streaming, batch and iterative processing. Depending on the state of the data to be processed, infrastructure preference, business use case, and kind of results that is of interest, choosing a single big data streaming technology platform that supports all the system requirements minimizes the effect of interoperability constraints.

From all the big data streaming tools and technologies reviewed, only IBMInfoSphere and TIBCO StreamBase support all of the three “at-most-once”, “at-least-once”, and “exactly-once” message delivery mechanisms while others support one or two of the three delivery mechanisms. Having all the three delivery mechanisms give room for flexibility.

It is rare to find a specific big data technology that combines key features such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness management, and load balancing. There seems to be no big data streaming tool and technology that offers all the key features required for now. This calls for more research efforts that are directed to building more robust big data streaming tools and technologies.

Few big data benchmarks are suitable for a hybrid of big data systems at present and standard benchmark datasets for big data streaming analytics have not been widely adopted. Hence, research efforts should be geared towards advancing benchmarks that are suitable for evaluating different big data systems.

Limitation of the review

While authors explored Scopus, ScienceDirect and EBSCO databases which index high impact journals and conference papers from IEEE, ACM, SpringerLink, and Elsevier to identify all possible relevant articles, it is possible that some other relevant articles from other databases such as Web of Science could have been missed.

The analysis and synthesis are based on interpretation of selected articles by the research team. The authors attempted to avoid this by cross-checking papers to deal with bias though that cannot completely rule out the possibility of errors. In addition, the authors implemented the inclusion and exclusion criteria in the selection of articles and only relevant articles written in the English Language were selected. Building on the underpinning of the findings of the research, while a lot of research has been done with respect to tools and technologies as well as methods and techniques employed in big data streaming analytics, method of evaluation or benchmarks of the technologies of various workloads for big data streaming analytics have not received much attention. As it could be gathered from the literature reviewed that most of the researchers evaluated their work using either synthetic or real-life datasets.

Conclusion and further work

As a result of challenges and opportunities presented by the Information Technology revolution, big data streaming analytics has emerged as the new frontier of competition and innovation. Organisations who seize the opportunity of big data streaming analytics are provided with insights for robust decision making in real-time thereby making them to have an edge over their competitors.

In this paper, the authors have tried to present a holistic view of big data streaming analytics by conducting a comprehensive literature review to understand and identify the tools and technologies, methods and techniques, benchmarks or methods of evaluation employed, and key issues in big data stream analysis to showcase the signpost of future research directions.

Although a lot of research efforts have been directed towards big data at rest (i.e. big data batch processing), there has been increased interest in analysing big data in motion (i.e. big data stream processing). With respect to issues identified in this paper, big data streaming analytics can be considered as an emerging phenomenon although some countries and industries have seized the opportunities by making it a pertinent research area. Some of the key issues such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness, load balancing, high throughput, and privacy that require further research attention were identified. While researchers have invested a lot of efforts to mitigate these issues, scalability, privacy and load balancing remain a concern. In addition, researchers also need to give more focus to the empirical analysis of big data streaming tools and technologies in order to be able to provide concrete reasons and support for choosing a tool/technology based on empirical evidence.

Presently, BigDataBench seems to be the only big data benchmark that can evaluate a hybrid of different big data systems. Standard benchmark for a hybrid of big data systems has not been widely adopted. It is rare to find a specific big data technology that combines key features such as scalability, integration, fault-tolerance, timeliness, consistency, heterogeneity and incompleteness management, and load balancing.

There is the need to give more attention to the preprocessing stage of social media stream analysis in the face of incomplete, noisy, slang, and abbreviated words that are pertinent to social media streams. Many researchers have looked at the aspect of the real-time analysis of big data streams but not much attention has been directed towards social media stream preprocessing.

In addition, research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data. As regards load balancing, a distributing environment that automatically streams partial data streams to a global centre when local resources become insufficient is required. The demand for big data stream analysis is that data must be analysed as soon as they arrive makes privacy issue a big concern. The main challenge here is proposing techniques for protecting a big data stream dataset before its analysis in such a way that the real-time analysis is still maintained. As a result, research efforts should be directed to the identified areas in order to have robust solutions for big data streaming analytics.