The stream processing model expects data to arrive in real-time and, therefore, the system does not have control over order nor frequency in which sampĺes arrive. Samples arrives in a continuous and unbounded way, and the size and type of data are unknown. In real-time stream processing, the sample is processed only once since the system must be available for new data [56]. Samples can be temporarily stored in memory, but the memory is small compared to the potential size of the data arriving in the streams. Thus, to achieve the goal of real-time data processing, stream processing imposes restrictions on processing time per sample because, if the data arrival rate is greater than the processing service rate, the waiting queue for data to be processed grows indefinitely. As a consequence, data would be discarded. Therefore, one approach to provide efficient stream processing is to use distributed processing platforms and approximation techniques to speed up data processing.
Traditional machine learning methods are based on systems that assume that all collected data is fully loaded in a single batch and, thus, it can be centrally processed [77]. As the volume of data increases, however, the existing machine learning techniques fail when faced with unexpected volumes of data and also with the requirement to return the processing output as quickly as the data is generated. Thus there is a need for developing new methods of machine learning with faster response and adaptive behavior to meet the demands of processing big data in real-time [56].
In many cases, certain patterns and behaviors are lost or hidden in the middle of a large volume of data. Machine-learning-based systems help to discover this lost or hidden information. This is possible because, when new information becomes available, decision structures are reviewed and updated. Several models update their parameters considering one sample at a time. These models are: incremental learning, online learning, and sequential learning [78]. Incremental methods do not have time or sample order restrictions, while online methods require samples to be processed in order and only once, according to the time of arrival. Many incremental algorithms can be used in an online manner, but algorithms intend to model a behavior over time require samples to be in order.
In dynamic and non-stationary environments, the distribution of data may change over time, producing the phenomenon of concept drift. Concept drift refers to changes in the conditional distribution of the output, that is, the probability of belonging to a target class vary, given the vector of input features, while the distribution of the features remains unchanged [78]. An example of concept drift is a change on the customer consumption patterns. The buying preferences of a customer may change over time, depending on the day of the week, availability of products, or salary changes. In the security area, the models to detect threats become obsolete with minimal variations in the composition of the attacks [79]. The concept drift affects the performance of most learning algorithms making them less accurate over time. An effective predictor must be able to track these changes and quickly adapt to them. A hard problem when dealing with concept drift is to distinguish between noise and an actual change. Some algorithms excessively react to noise, misinterpreting it as concept drift. Others are highly robust to noise, adjusting to changes very slowly. The four different types of known concept drift are shown in Fig. 4: (i) sudden or abrupt change; (ii) incremental change; (iii) gradual change, and (iv) recurring or cyclical change.
Different approaches to detect concept drift can be used depending on the classification domain [10]. The first and simplest assumes that the data is static and, therefore, there is no change in the distribution of the data. It is possible to train the model only once and use the same model for future data. Another approach is to periodically update the static model with more recent historical data, also known as incremental learning. Some machine learning algorithms such as regression algorithms or neural networks make it possible to assess the importance of input data. In these algorithms, it is possible to use a weighting inversely proportional to the history of the data, so that more recent data is more important, with greater weight, and less recent data is less important, with smaller weight. Another approach is to use classifier sets algorithms such as AdaBoost or Random Forest. Thus, the static model remains intact, but a new model learns to correct the predictions of the static model based on the most recent data relationships. Finally, it is also possible to detect concept drift using heuristics or intrinsic data statistics. Heuristics such as accuracy or precision are mainly used in a supervised learning scenario in which data labels are present during training and classification. However, the presence of labels during classification is not usual in a production environment. In the unsupervised learning scenario, the statistical comparison of incoming samples, or the grouping of samples, with the samples used to train the system, assume that a concept drift is detected whenever new groups are found [80]. These methods of detecting changes tend to be more computationally intensive since measures based on distances are performed on the samples obtained.
Techniques for mining data streams
Gaber et al. categorize solutions for handling streams as data-based or task-based [81]. Data-based solutions aim to decrease the data representation, either through horizontal transformations, decreasing the number of features to handle, or vertical transformations, selecting a subgroup of samples to handle, also known as sampling. Task-based solutions focus on deploying computational techniques to find efficient solutions in terms of both time and storage space while data-based techniques rely on summarizing data or choosing subsets of data from the input data stream. Some of the data-based solutions are summarized as follows.
Data Sampling represents data samples as tuples, which are either selected for processing or discarded randomly. In the case of data arriving at a rate higher than the system can process promptly, sampling reduces the data arrival rate by discarding tuples. A possible usage scenario for data sampling is when collecting data on a high speed network. Instead of processing every single data sample, we process an approximate result of the collected data, without indefinitely increasing the queue of pending samples, and with less resource constraints than the complete operation [56]. The classic algorithm for maintaining online random sampling is the reservoir sampling technique. The algorithm maintains a sample of size s, called a reservoir. When new data streams arrive, each new element is likely to replace an old element in the reservoir. An extension of this algorithm is to keep samples of the most recent data of size k over a sliding window of size n.
Load Shedding refers to the process of discarding sequences of data streams when the input rate exceeds the processing capacity of the system. Thus, the system achieves an adaptive behavior to meet the latency requirements. This technique also causes loss of information. It generally applies to dynamic query systems. In data mining, load disposal is difficult to use, as the disposal of data blocks from the stream can lead to the loss of useful data for building models. It can also discard patterns of interest in a time series.
Sketching is the process of designing a random subset of attributes, or domain, of the data. The key idea is to produce faster results with mathematically proven error limits. The sketch does a vertical sampling, excluding attribute columns, of the data that arrives as a stream. Sketching’s main disadvantage is the loss of accuracy because the results are an approximation. As an alternative to sketching on machine learning applications over data streams, there is the PCA, in which instead of using a subset of the attributes, a linear combination of attributes reduces the data dimensions while maximizing data variance.
Synopsis Structures are in-memory data summary structures. The key idea is to generate an approximated result while reducing memory complexity. The hash sketches proposal, for instance, creates a vector of bits with size L, where L=log(N), and N is the number of data samples. Let lsb(y) be the function that denotes the position of the least significant bit 1 in the binary representation of y. The incoming x data is mapped into a position of the bit vector using a uniform hash(x) function and the lsb(hash(x0)) function, which marks in the bit vector the occurrence of the data sample. From this proposal, it can be defined that since R is the position of the rightmost zero value in the bit vector, it is possible to estimate the number of elements in the bit vector as E[R]=log(ϕd), where ϕ=0.7735, and d=2R/ϕ. The use of histograms to estimate the relative frequency of samples in streams also constitute synopsis data structures.
Aggregation is the technique of summarizing incoming data. The summary may assume the form of average, variance, maximum, or minimum. The memory cost when using this technique is very low, but the technique fails if the data stream varies widely. It is worth mentioning as aggregation techniques the recursive average calculation, given by
$$ \overline{x_{i}} = \frac{(i-1)\times\overline{x_{i-1}}+x_{i}}{i}, $$
(9)
where xi is the current average, after i rounds, and the recursive variance calculation, given by
$$ \sigma_{i} = \sqrt{\left(\sum x^{2}_{i} - \left(\sum x_{i}\right)^{2}/i\right)/(i-1)}. $$
(10)
Wavelets is a transform in which the signal is represented by a sum of waveforms, simpler and more defined in the construction of the transform, in different scales and positions. The wavelet transform aims to capture trends in numerical functions, decomposing the signal into a set of coefficients. Similar to the feature reduction using PCA, the lowest-order coefficients can be discarded. The reduced set of coefficients is used in algorithms that operate on the data stream. The most used transform in the context of data stream processing algorithms is Haar wavelet.
Task-based data processing techniques are methods that modify existing techniques, or create new ones, to meet the computational challenge of stream processing. The key task-based solutions follow.
Approximation algorithms return approximate computation results limited by error thresholds [81]. In streaming data mining, in particular, algorithms with approximate results are commonplace, as results are expected to be generated continuously, quickly, and with limited computational resources.
Time windows are commonly used to resolve queries on stream data with an undefined end. Instead of performing a computation on the complete data, it runs over a subset of data, possibly more than once over the same data. In this model, a timestamp is associated with each incoming data. The timestamp defines whether a data sample is inside or outside the window being considered. The computation is performed only on the data sample that is within the considered window. Some alternate approaches to time windows are the landmark window, the hopping window, the sliding window, and the tilted window.
The landmark window identifies relevant points in the data streams and, from there on, calculates the aggregation operators from that landmark. Successive windows share the same beginning and are of increasing size. A landmark may be, for instance, the start of a new day for daily data aggregation. The hopping window is a structure that considers a certain fixed number of samples and, when a subsequent set with a sufficient number of samples arrives, the previous ones are discarded, and computation is done on the new sample set. The hopping window uses the fixed sample size set only once, and each sample is only used in one set. The sliding window, in turn, is a data structure of fixed size that inserts more recent samples and removes the oldest samples in a similar way to the model of “first in, first out”. Such a structure is computationally interesting since, in most cases, the whole past is not as relevant as the recent past. Thus, the sliding window considers only a fixed number of samples in the most recent past. The sliding window structure is widely used for calculating moving averages. Finally, the tilted window creates different resolutions for aggregating the data. Unlike previous windows, where samples were either inside the window or outside, in the skewed window, the most recent samples are treated with fine granularity, while samples from the past are grouped with lower granularity. The farther in the past, the coarser the grouping granularity of samples.
Online machine learning methods
A first approach for handling streaming big data is the use of learning methods capable of learning from infinite data in finite time. The central idea is to apply learning methods that limit the loss of information when using models with finite data in relation to models with infinite data. For this purpose, the loss of information is measured as a function of the number of samples used in each learning step and, then, the number of samples used in each step is minimized maintaining the loss threshold limit. The resolution of the problem of how much information is lost when decreasing the number of samples is given using the Hoeffding bound [78]. Considering a random variable x, whose value is contained in the interval R, it is assumed that independent observations of the variable are made and the mean \(\bar {r}\) is computed. The Hoeffding bound ensures that, with probability 1−δ, the true variable mean is given by at least \(\bar {x}-\epsilon \), where
$$ \epsilon = \sqrt{\frac{R^{2}ln(1/\delta)}{2n}}. $$
(11)
The Hoeffding bound is independent of the distribution that generates the variable x. From this result, machine learning algorithms for training with stream data are developed. It is worth mentioning, however, that the values generated by the variable x are assumed to come from a stationary stochastic process. In cases where there is a change in the process that generates the variable used in the training of stream learning methods, a concept drift occurred and, therefore, it is necessary a new training of the learning method [82].
Typical approaches to learning new information involve maintaining the stochastic behavior of the data or discarding the existing classifier and, consequently, retraining with the data accumulated so far. Approaches that consider the end of the statistical stability of the data, i.e., the process is no longer stochastic, result in the loss of all previously acquired information. This phenomenon is known as catastrophic forgetfulness. Polikar et al. define that incremental learning algorithms must satisfy the following requirements: obtaining additional information from new data; not requiring access to the original data used to train the existing classifier; preserving previously acquired knowledge, i.e., it should not suffer catastrophic forgetting; and accommodating new classes that new data can introduce. Thus, classifiers that adopt incremental learning do not require training of the entire classifier in the event of a change in the steady behavior of the streaming data.
Incremental decision tree online algorithms
These algorithms are divided into two categories: (i) trees built using a greedy search algorithm, in which the addition of new information involves the complete restructuring of the decision tree, and (ii) incremental trees that maintain a sufficient set of statistics at each node of the tree to perform a node split test, making the classification more specific, when the accumulated statistics at the node are favorable to the division. An example of this type of incremental tree is the Very Fast Decision Tree (VFDT) algorithm [83]. The purpose of VFDT is to design a decision tree learning method for extremely large, potentially infinite datasets. The central idea is that each sample of information is read only once and in short processing time. It makes possible to directly manage online data sources, without storing samples. To find the best attribute that should be tested on a given node, it may be sufficient to consider only a small subset of the training samples that pass through that node. Thus, given a stream of samples, the first sample will be used to choose the root test; once the root attribute is chosen, subsequent samples are transmitted to the corresponding leaf nodes and used to choose the appropriate attributes on those nodes, and so on, recursively.
In a VFDT system, the decision tree is learned recursively, replacing leaves with decision nodes. Each leaf stores sufficient statistics on attribute values. Sufficient statistics are those required by a heuristic evaluation function that assesses the merit of node split tests based on attribute values. When a sample is available, it traverses the tree from the root to a leaf, evaluating the appropriate attribute on each node and following the branch corresponding to the attribute value in the sample. When the sample reaches the leaf, the statistics are updated. Then, the possible conditions based on the values of the attributes are evaluated. If there is sufficient statistical support in favor of a value test of one attribute concerning the others, the leaf is converted into a decision node. The new decision node will have as many descendants as possible values for the chosen decision attribute. Decision nodes maintain only information about the split test installed on the node. The initial state of the tree consists of a single leaf that is the root of the tree. The heuristic evaluation function is the Information Gain, denoted by H(·). The statistics that are sufficient to estimate the merit of a nominal attribute are the counters, nijk, that represent the number of examples of the class k that reach the leaf, where the attribute j receives the value i. The information gain measures the volume of information needed to classify a sample arriving at the node: H(Aj)=inf(samples)−info(Aj). The j attribute information is given by
$$ info(A_{j}) = \sum_{i} P_{i} \left(\sum_{k} - P_{ik}log_{2} (P_{ik}) \right), $$
(12)
where \(P_{ik} = \frac {n_{ijk}}{\sum _{a} n_{ajk}}\) is the probability of observing the value of the attribute i given the class k, and \(P_{i} = \frac {\sum _{a} n_{ija}} {\sum _{a}\sum _{b}n_{ajb}}\) is the probability of looking at the value of the i-th attribute.
In the VFDT system, the Hoeffding threshold given in Eq. 11 is used to decide how many samples are necessary to observe before installing a separation test on each leaf. Let H(·) be the attribute evaluation function for the information gain, then H(·) is log2(||K||), where K is the set of classes. Let xa be the attribute with the highest value of H(·),xb the attribute with the second-highest value of H(·), and \(\overline {\Delta H}=\overline {H}(x_{a}) - \overline {H}(x_{b})\), the difference between the two best attributes. Hence, if \(\overline {\Delta H} > \epsilon \), with n samples observed on the leaf, the Hoeffding threshold defines with probability 1−δ that xa is the attribute with the highest value in the evaluation function. Thus, the leaf must be transformed into a decision node that divides in xa.
The function evaluation for each sample can be costly and, therefore, it is not efficient to compute H(·) on the arrival of each new sample. The VFDT proposal only computes the attribute evaluation function when a minimum number of samples, defined by the user, is observed since the last evaluation. When two or more attributes have the same values of H(·) continuously, even with a large number of samples, the Hoeffding threshold is not able to decide between them. Then, VFDT introduces a constant τ where \(\overline {\Delta H} < \epsilon < \tau \). Hence, the leaf is converted into a decision node and the decision test is based on the best attribute. Gama et al. generalize the functioning of the VFDT system for numeric attributes [78].
Incremental Naive Bayes
Given a training set χ=(x1,y1),⋯,(xN,yn), where \(\mathbf {x} \in \mathbb {R}^{D}\) are samples with D-dimensions in the attribute space and y∈{1,⋯,K} are the classes corresponding to a classification problem, Bayes’ Theorem is formulated as
$$ p(\mathit{y}=i|x) = \frac{p(i)p(\mathbf{x}|i)}{p(\mathbf{x})}, $$
(13)
where p(i) is the a priori probability of a sample of the class occurring, and p(y|x) is the unknown probability distribution of the x attribute space and marked with class i. An estimate for the unknown distribution is to assume the independence of the attributes given the marking of the class, leading to
$$ p(x_{1},x_{2},\cdots,x_{D}|i) ~ p(x_{1}|i)p(x_{2}|i) \cdots p(x_{D}|i), $$
(14)
where xd represents the d-th dimension in the x attribute array. Thus, the Bayesian classifier is described as
$$ F(\mathbf{x}) = \underset{i}{arg~max} \prod_{d=1}^{D} p(x_{d}|i). $$
(15)
Hence, the classification is calculated by multiplying all the probabilities of the classes for each value of the sample attributes [84].
The incremental version of the Bayesian classifier predicts updating the values of the class probabilities by attributes according to the processing of new samples. One approach to allow efficient storage of probability functions as samples arrive is to discretize and store attribute histograms [10]. The Incremental Flexible Frequency Discretization (IFFD) proposal presents a method for discretizing quantitative attributes in a sequence of flexible size intervals [85]. This approach allows the insertion and division of intervals.
Incremental Learning by Classifier Aggregates
The Adaptive Ressonance Theory Mapping (ARTMAP) algorithm [86] is based on the generation of new decision clusters in response to new patterns that are sufficiently different from previously seen instances. The difference between an already known pattern and a new one is controlled by a user-defined surveillance parameter. Each grouping learns in a hyper-rectangle that is a different portion of the feature space, in an unsupervised way, which is then mapped to target classes. Clusters are always maintained as ARTMAP to avoid catastrophic forgetting. In addition, ARTMAP does not require access to previously seen data, and can accommodate new classes. However, ARTMAP is very sensitive to the selection of the surveillance parameter, the noise levels in the training data, and the order in which the training data arrives.
The AdaBoost algorithm generates a set of hypotheses and combines them through the voting of a weighted majority of the classes predicted by each individual hypotheses [87]. The hypotheses are generated by training a weak classifierFootnote 15 using instances extracted from a periodically updated distribution of training data. This distribution update ensures that instances poorly classified by the previous classifier are more likely to be included in the training data of the next classifier. Thus, the training data of consecutive classifiers is aimed at instances that are increasingly difficult to classify.
The Learn ++ incremental learning algorithm is inspired in AdaBoost and was originally developed to improve the classification performance of weak classifiers [88]. In essence, Learn ++ generates a set of weak classifiers, each trained using a different distribution of training samples. The outputs of these classifiers are then combined using a majority voting regime to obtain the final classification rule. The use of weak classifiers is interesting, as the instabilities in building their decisions is sufficient for each decision to be different from the others so that small changes in their training datasets are reflected in different classifications.
Incremental K-Means
The k-means clustering algorithm performs an iterative optimization using the sum of the squared distances of all points at the center of each cluster. In the case of an unsupervised machine learning clustering problem, N is defined as the set of entries \( \overline {x}_{1}, \overline {x}_{2},\cdots, \overline {x}_{N}\), where \(\overline {x}_{i} \in R^{n}\). The problem then is to find K<<N clusters c1,c2,⋯,ck with centers in \(\overline {\omega }_{1}, \overline {\omega }_{2}, \cdots, \overline {\omega }_{k}\), respectively, such that the distance D from the squared sum of the data distances to the centers of the clusters to which they belong is minimal. The distance D is equivalent to the Mean Square Error (MSE) and is given by
$$ D=\frac{1}{N}\sum^{K}_{j=1}\sum_{\overline{x} \in c_{j}}(\overline{x}-\overline{\omega}_{j})^{2}. $$
(16)
The iterations of the classic algorithm consist of an initialization, in which K centers are chosen, and the elements are classified by the rule of the nearest-neighbor. Afterward, the centers of the clusters are updated by calculating the centroid of each cluster. In the following iterations, the data is reclassified according to the new calculated centroids. The iterations are performed until the convergence of the algorithm, which is achieved when the centers of the clusters calculated in iteration i are identical to those of iteration i+1.
The adaptation of the k-means algorithm for sequential treatment of data consists of recalculating the centers of the clusters whenever a new sample arrives. This depends on all data already processed being accessed again to recalculate the centers of the clusters, which generates a substantially high demand for computational resources, making its use unfeasible. The variation of the algorithm using sequential blocks, in turn, enables the use of the k-means online algorithm since the clustering is performed on accumulated data blocks. Each block is used for l times of the k-means algorithm, and the results of the cluster centers of block i are used as the initial centers of the iteration over block i+1. The incremental variation of the algorithm defines that the block is used only once [89]. The validity of the result of the incremental algorithm lies in the fact that the probability distribution of the sample attributes does not change, or changes slowly, between blocks.
Reinforcement learning
Reinforcement Learning (RL) is based on an agent interacting with the environment to learn an optimal policy by trial and error, for sequential decision-making problems in the fields of natural and social sciences and engineering [90]. The reinforcement learning agent interacts with the environment over time. At each step t of time, the agent applies a policy π(at|st), where st is the state that the agent receives from a state space S, and at is an action selected by the agent in an action space A. The agent maps the state st into an action at to receive a reward, or penalty, rt, escalate rt, and transition to the next state st+1. The transition occurs according to the dynamics or the model of the environment. The function R(s,a) models the agent’s reward and the function P(st+1|st,at) models the probability of transition between agent states. The agent continues the execution until it reaches a terminal state when the cycle is restarted. The return \(R_{t} = \sum ^{\infty }_{k=0}\gamma ^{k} r_{t+k}\) is the accumulated reward discounted by a factor γ∈[0,1). The central idea of reinforcement learning is that the agent maximizes the expectation of a long-term return for each state. It is worth mentioning that reinforcement learning mechanisms assume that the problem satisfies Markov’s property, in which the future depends only on the current state and action, and not on the past. Thus, the problem is formulated as a Markov Decision Process (MDP), defined by the 5-tuple (S,A,P,R,γ).
Temporal difference learning is one of RL’s pillars, as it refers to the learning methods for assessing the value function, StateActionRewardStateAction (SARSA) and Q-learning. Temporal difference learning discovers the value function V(s) directly from experience using the time difference error, with model-free initialization, online, and fully incremental. Temporal difference learning is a prediction problem. The update rule is V(s)←V(s)+α[r+γV(s′)−V(s)], where α is the learning factor and r+γV(s′)−V(s) is the so-called temporal difference error. The use of the generalized gradient descent algorithm guarantees the convergence of time difference learning problems [79].
The problem of policy forecasting or evaluation with reinforcement learning consists of calculating the state or action-value function for a policy. The control problem is to find the optimal policy. The SARSA algorithm evaluates the policy based on samples of the same policy and refines the policy using a greedy methodology with the action values. In off-policy methods, the agent learns a value function or ideal policy, maybe following an unrelated behavioral policy. The Q-learning algorithm, for example, tries to directly find action values for the optimal policy, not necessarily adjusting to the policy that generates the data. Thus, the policy found by Q-learning is generally different from the policy that generates the samples [90]. The notions of in-policy and off-policy methods relate to the ideas of evaluating the solutions found with the same policies that generate the data or evaluating with slightly different policies. Evaluating with different policies refers to using a model to generate the data, while in methods with no model, the agent learns through trial and error in the actions taken. Reinforcement learning algorithms can run in online or offline mode. In online mode, the training algorithms are executed on data acquired in sequential streams. In offline or batch mode, models are trained on the entire dataset.
Reinforcement learning is present in works that carry out knowledge extraction and decision making in wireless networks [91–94]. Liu and Yoo propose a scheme that uses the Q-learning algorithm to dynamically allocate blank subframes so that both LTE-U and Wi-Fi systems can successfully coexist [91]. The adjustment technique is based on reinforcement learning. The proposal introduces a new LTE-U frame structure, which in addition to allocating blank subframes, also reduces the LTE delay. In the proposal, the actions of the algorithm are modeled through a tuple that contains the total number of subframes in a block of frames and the portion of subframes for LTE-U. The states are modeled as the set of all actions taken up to that state. Tabrizi et al. argue that next-generation communications tend to use an integrated network system, in which Wi-Fi access points and cellular network base stations work together to maximize the QoS of the mobile user [92]. Thus, the authors propose that mobile devices with different access technologies easily switch from one access point or base station to another to improve user performance. The proposal is based on network selection to maximize QoS, formulated as a MDP. Therefore, they use an algorithm based on reinforcement learning designed to select the best network based on the current network load and also possible future network states. Chen and Qiu propose an approach for cooperative spectrum sensing using Software-Defined Radio (SDR) based on the Q-learning algorithm [93]. In sensing the spectrum using Q-learning, the state set consists of all combinations of a bit vector where each user sets a bit and the set of actions is 0,1, where “0” means that the channel is in the “available” state for secondary users, and an action with index “1” means that the channel is in the “busy” state, unavailable for secondary users. In the proposed algorithm, the rewards are positive when the action is in accordance with the occupation of the channel and negative, otherwise. Besides, Santos Filho et al. introduce a bandwidth control mechanism for cloud providers based on the Q-Learning algorithm [94]. Santos Filho et al. constantly adapt parameters on the cloud network infrastructure to meet the clients’ Service Level Agreement, while the cloud provider maximizes the network occupancy and, thus, the provider’s revenue.
Deep learning
Deep Learning (DL) is a class of machine learning techniques that explores multiple layers of nonlinear information processing to transform and extract higher-level information from original data [95]. It may be either supervised or unsupervised. In particular, DL is in the intersections among the research areas of neural networks, artificial intelligence, graphics modeling, optimization, pattern recognition, and signal processing. DL is generally used in the image, sound, and text processing [96].
The key idea of DL is to generate new representations of data for each layer, increasing the degree of abstraction of data representation. The increasing popularity of DL techniques occurs due to the accelerated increase in the processing capacity of chipsets, such as Graphic Processing Unit (GPU); the significant increase in the data available for training models; and recent advances in machine learning research [96], which has allowed the exploration of complex non-linear composition functions, distributed and hierarchical learning, and the effective use of labeled and non-labeled data.
DL architectures and techniques are used for data synthesis/generation or recognition/classification and, therefore, are generally classified in [97]: Deep generating architectures, which characterize the high order correlation properties of the observed data for pattern analysis or synthesis and/or joint statistical distributions characterization of the observed data and their associated classes. Deep discriminative architectures, which provide values to perform the discrimination of data into classes of patterns and, sometimes, characterizing the distributions a posteriori of classes conditioned to the observed data. Deep hybrid architectures, which discriminate data in classes assisted with the results of generating architectures through optimization and/or regularization, when discriminative criteria are used to learn the parameters in generative models. Generating architectures are associated with the identification and recognition of hidden patterns in observed data, while discriminative architectures are associated with the classification of observed data into defined classes. It is noteworthy that the generating architectures are related to problems of unsupervised learning, while discriminative architectures are related to problems of supervised learning. In the unsupervised learning process, there is no labeled data and, therefore, the main goal is to generate labeled data using unsupervised learning algorithms, such as Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Deep Neural Network (DNN), generalized AutoEncoders, and Recurrent Neural Network (RNN) [96].
RBMs are generative probabilistic models capable of automatically extracting characteristics from the input data using completely unsupervised learning algorithm [98]. RBMs consist of a hidden layer and a visible layer of neurons with connections between the hidden and visible neurons represented by an array of weights. To train an RBM, samples from a training dataset are used as input to the RBM via visible neurons, and the network generates samples alternately back and forth between the visible and hidden neurons. The purpose of the training is to learn weights of the connections between visible and hidden neurons and to learn the neuron activation bias so that the RBM learns to reconstruct the input data during the phase when the visible neurons of the hidden neurons are sampled. Each sampling process is essentially a multiplication of matrices between a batch of training samples and the weight matrix, followed by a neuron activation function, which in many cases is (1/(1+e−x)). Sampling between the hidden and the visible layers is followed by a slight change in the parameters, defined by the learning rate α, and repeated for each batch of data in the training dataset, and for as many times as necessary to achieve convergence [98].
The DBN consists of multiple layers of stochastic and hidden variables and is related to the RBM, as it consists of the composition and stacking of several RBMs. The composition of multiple RBMs allows many hidden layers to efficiently train data through activations of an RBM for additional training stages [96, 98].
Convolutional Neural Network (CNN) represent one of the most common forms of DNN [99]. CNNs have multiple convolutional layers and each layer generates a map of characteristics, a higher-level abstraction of the input data, which preserves essential and unique information. Each of the convolutional layers in CNN presents, mostly, high-dimension convolutions, in which the input layer activations are structured as a set of input maps of characteristics, called a channel. For this reason, CNNs are generally used in signal processing. Each channel is converted with a different filter from the filter stack. The result of this computation is the output activations that creates an output characteristics map channel. Finally, several input feature maps can be processed together in batch to potentially improve the reuse of the filter weights.
AutoEncoder has traditionally been used for dimensionality reduction and feature learning. The fundamental idea of AutoEncoder is the presence of a hidden layer h that refers to the input, and two other main parts: the encoding function h=f(x), and decoding or reconstruction function r=g(h). The encoder and the decoder are trained together and the discrepancy between the original data and its reconstruction is then minimized. The deep AutoEncoder is a part of an unsupervised model [96].
Conventional Neural Networks are based on the principle that all data points are independent. For this reason, if data points are related in time or space, the chance of losing the state of the network is high. RNN are based on sequences so that they can model inputs or outputs composed of several independent elements. RNN can be used in unsupervised or supervised learning. When used in unsupervised learning, the prediction of the data sequence from previous data samples is possible, but it is difficult to train [96].
DL methods are usually trained with the descending stochastic gradient approach [79], in which a training example, with a known label, is used to update the model parameters. The strategy fits online stream learning. A strategy to speed up learning is to carry out updates on mini data batches instead of proceeding sequentially with one sample at a time [100]. The samples in each mini-batch are as independent as possible to provide a good balance between memory consumption and runtime.
Recent work use DL to infer the characteristics and behavior of wireless networks. Wang et al. developed an algorithm based on DL to explore bi-modal data [101]. The goal is to estimate the phase angle of arrival and average amplitudes on the 5 GHz radio interface for wireless networks. Their algorithm generates the indoor location fingerprints of devices. DL produces feature-based fingerprints from bi-modal data in the offline training stage. The weights in the deep AutoEncoder network are the fingerprints based on characteristics for each position. Wang et al. compare two indoor location approaches using DL, one with AutoEncoder and the other with Convolutional Neural Networks [102]. The authors conclude that the approach based on AutoEncoder presents less error in the inference of the indoor position. Turgut et al., in turn, use deep AutoEncoder to perform the indoor localization of devices, considering as initial characteristics the signal strength received from 26 access points [103]. Wang et al. use DL to recognize more accurate and robust activity on wireless channels. The central idea is to actively select available Wi-Fi channels with good quality and switch between adjacent channels to form an extended channel. Authors search for sequential patterns of channel usage, then, adopt a model of a Recursive Neural Network [104].
GPU improves performance in data consumption tasks using parallel computing. GPUs allow running a large number of threads in parallel, making it attractive for the computationally intensive tasks of state-space exploration such as DL [105]. OpenCL [106] and Compute Unified Device Architecture (CUDA)[107] are the main Application Programming Interface (API) to program and manage GPUs. CUDA platform, developed by Nvidia, and OpenCL give access to the GPU’s virtual instruction set to run parallel tasks. The CUDA platform works with programming languages such as C, C ++, and Fortran. Some extensions are currently working with dynamically-typed languages, such as Python, with pycuda, pyopenCL [108], or numba [109]. A single GPU system computes a throughput of approximately 15 TFlops. Nevertheless, current solutions create a cluster of GPUs, increasing up to 2.6 times the speedup for a 4-GPU cluster [110], improving Distributed Deep Learning (DDL) tasks performance up to 5.5 times in a 100-GPU cluster[111].
Table 4 presents a qualitative comparison of the main techniques to process large streaming data discussed in this paper.
Table 4 Comparison between different knowledge extraction techniques. Stream processing applies different techniques to extract knowledge from arriving data. The techniques that best fit the envisioned mining scenario depends on the available data, on the training model, and on the expected goals of the processing