Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text

The increase in people’s use of mobile messaging services has led to the spread of social engineering attacks like phishing, considering that spam text is one of the main factors in the dissemination of phishing attacks to steal sensitive data such as credit cards and passwords. In addition, rumors and incorrect medical information regarding the COVID-19 pandemic are widely shared on social media leading to people’s fear and confusion. Thus, filtering spam content is vital to reduce risks and threats. Previous studies relied on machine learning and deep learning approaches for spam classification, but these approaches have two limitations. Machine learning models require manual feature engineering, whereas deep neural networks require a high computational cost. This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically. The proposed model utilizes convolutional and pooling layers for feature extraction along with base classifiers such as random forests and extremely randomized trees for classifying texts into spam or legitimate ones. Moreover, the model employs ensemble learning procedures like boosting and bagging. As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.


Introduction
Mobile messaging service has become one of the most common means of communication among people since they allow individuals to communicate with one another at any time and from any location. Besides, there is a vast number of messaging apps that provide their service for free. According to statistics, 95% of mobile messages in the USA are read and responded to within three minutes of receiving [1]. In addition, Short Message Service (SMS) offers businesses an enormous chance to communicate and interact with customers as 48% of consumers prefer direct communication from businesses via SMS [1]. As a result, users are prone B Mai A. Shaaban mai.shaaban@alexu.edu.eg to SMS attacks such as spam and phishing, especially users who lack awareness about cyber threats.
Spam text is any undesired text transmitted to people without their permission and may include a link to a phone number to call, a link to open a website, or a link to download a file. Thus, an attacker can masquerade as a trusted entity and exploit spam texts by attaching malicious links so that victims may be duped into clicking a harmful link, resulting in installing malware or revealing sensitive information, including login credentials and credit card numbers [2,3]. For example, phishing attacks can occur by sending fake messages for users telling them to reset their passwords to be able to login to Facebook, Twitter, or any other platform [4]. Besides, spammers can share misleading information about the COVID-19 pandemic causing a negative impact on society [5]. Therefore, filtering spam texts is crucial to protect users against social engineering attacks, mobile malware and threats.
Previous studies in the area of spam classification focused on using machine learning algorithms [6], but these algorithms require prior knowledge and domain expertise for identifying useful features in order to achieve accurate classification [7]. Furthermore, researchers proposed deep learning approaches to detect spam [5]. However, deep neural networks require much effort in tuning hyper-parameters [8]. Besides, they require massive data for training to predict new data accurately. Consequently, they require a high computational cost [8].
To overcome the high complexity of deep learning models and to reduce the effort spent in tuning hyper-parameters, Zhou and Feng [8] developed multi-grained cascade forest (gcForest), a decision tree ensemble approach that can be applied to different classification tasks and has much fewer hyper-parameters than deep learning neural networks. Ensemble methods [9] train multiple base models to produce a single best-fit predictive model. Kontschieder et al. [10] demonstrated that employing ensemble approaches like random forests [11] aided by deep learning model features can be more effective than solely using a deep neural network. gcForest applies multi-grained scanning for extracting features and employs a cascade structure (i.e., layer-by-layer processing of raw features) for learning. Inspired by gcForest, this paper enhances the procedure of feature engineering by replacing the multi-grained scanning with convolutional layers and pooling layers to capture high-level features from textual data. The motivation for using gcForest as a baseline for this paper is that gcForest is the first deep learning model that trains data without relying on neural networks and backpropagation, as the authors claimed [8].
This paper introduces a dynamic (self-adaptive) deep ensemble mechanism to overcome the stated limitations of machine learning and deep learning approaches for detecting spam texts. The main contributions of this paper are as follows: -Implementing a dynamic deep ensemble model called Deep Convolutional Forest (DCF). -Extracting features automatically by utilizing convolutional layers and pooling layers. -Determining the model complexity automatically so that the model can perform accurately on both small-scale data and large-scale data. -Classifying text into Spam or Ham (Not-Spam), achieving a remarkable accuracy.
The rest of this paper is arranged as follows: "Related work" provides the literature review, "Methodology" explains the word embedding technique, followed by the detailed explanation of Deep Convolutional Forest (DCF), "Experimental results" shows the results of the proposed method along with results of traditional machine learning classifiers and existing deep learning methods, "Discussion" discusses the findings and explains why the proposed solution outperforms the existing solutions. Finally, "Conclusion" concludes the proposed work and contains recommendations for future research.

Related work
Over recent years, computer scientists have published a considerable volume of literature on spam detection [5,6,12,13]; these works were limited to using machine learning and deep learning based models.
Bassiouni et al. [14] experimented multiple classifiers to filter emails gathered from the Spambase UCI dataset, which contained 4601 instances. They performed data preprocessing; then they selected features using Infinite Latent Feature Selection (ILFS). Finally, they classified emails with an accuracy of 95.45% using Random Forest (RF), while the following remaining classifiers: Artificial Neural Network (ANN), Logistic Regression, Support Vector Machine (SVM), Random Tree, K-Nearest Neighbors (KNN), Decision Table ( Merugu et al. [15] classified text messages into Spam and Ham category with an accuracy rate of 97.6% using Naive Bayes, which proved to outperform other machine learning algorithms such as Random Forest, Support Vector Machine and K-Nearest Neighbors according to the experimental results. The messages were collected from the UCI repository, which contained 5574 variable-length messages. To feed data into a classification model, the authors converted messages into fixed-length numerical vectors by creating term frequency-inverse document frequency (TF-IDF) [16] vectors using the bag of words (BoW) model.
In 2020, Gaurav et al. [17] proposed spam mail detection (SPMD) method based on the document labeling concept, which sorts the new messages into two categories: Ham and Spam. Experimental results illustrated that Random Forest produced the highest accuracy of 92.97% among the following three classification models: Naive Bayes, Decision Tree and Random Forest. Lately, researchers have proposed deep learning methods such as Convolutional Neural Network (CNN) [18] and Short-Term Memory (LSTM) [19,20] for categorizing Spam and Ham messages. Popovac et al. [18] applied a CNN-based architecture after performing data preprocessing steps including tokenization, stemming, preservation of sentiment of text and removal of stop words. The feature extraction process involved transforming a text into a matrix of TF-IDF [16] features. According to their experiment, CNN proved to be effective more than machine learning algorithms with an accuracy score of 98.4%. Another spam filtering model was proposed by [19]; they combined an N-gram TF-IDF feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron neural network model with rectified linear units (DBB-RDNN-ReL). Although their model was computationally intensive, the model provided an accuracy of 98.51%.
Jain et al. [20] used sequential stacked CNN-LSTM model (SSCL) to classify SMS spam with an accuracy of 99.01%. They converted each text into semantic word vectors with the help of Word2vec, WordNet and ConceptNet. However, searching for the word vectors in these embeddings caused system overload.
Ghourabi et al. [21] presented a hybrid model for classifying text messages written in Arabic or English that is based on the combination of two deep neural network models: CNN and LSTM. The results indicated that the CNN-LSTM model scored an accuracy of 98.37%, which is higher than other techniques like Support Vector Machine, K-Nearest Neighbors, Multinomial Naive Bayes, Decision Tree, Logistic Regression, Random Forest, AdaBoost, Bagging classifier and Extra Trees.
In 2020, Roy et al. [7] focused on how to effectively filter Spam and Not-Spam text message downloaded from the UCI Repository [22], which contains 5574 instances. They tested deep learning algorithms such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) as well as machine learning classifiers such as Naive Bayes (NB), Random Forest (RF), Gradient Boosting (GB), Logistic Regression (LR) and Stochastic Gradient Descent (SGD). The experimental results confirmed that applying CNN with three convolutional layers and a regularization parameter (dropout) on randomly sampled tenfold cross validation data resulted in an accuracy of 99.44%. However, the authors spent much effort in tuning hyper-parameters.
As mentioned in "Introduction", this study aims to handle feature relationships in textual data by using convolutional layers together with pooling layers as alternatives to the multi-grained scanning procedure proposed by [8]. Zhou and Feng [8] proposed an ensemble of ensembles mechanism, meaning that each level learns from its previous level and each level has an ensemble of decision-tree forests.
Descriptions of existing text-based spam detection techniques with respect to datasets, feature extraction and selection methods, types of algorithms, and performance measure are covered in Table 1.

Methodology
Data pass through two main phases as shown in Fig. 1: the first phase is applying the word embedding technique after preprocessing to convert textual data into a numerical form and the second phase is using Deep Convolutional Forest (DCF) to extract features and classify text as illustrated in Fig. 2. The proposed method analyses the SMS spam dataset, which is publicly available [22]. The dataset has a collection of messages where each message is either 1 (Spam) or 0 (Not-Spam). First of all, text messages were prepared by splitting each message into a list of words and then perform-ing text preprocessing techniques like stemming and stop words removal. Afterward, each word was transformed into a sequence of numbers called a word vector; the word vector is generated using the word embedding technique as explained in "Word embedding". Finally, the generated word vectors are sent as a word matrix to DCF for classification and determining whether a message is Spam or Not-Spam as explained in "Deep convolutional forest".

Word embedding
Word embedding is a technique where each word is represented by a vector holding numbers indicating the semantic similarity to other words in text corpus (i.e., similar words have similar representations). The difference between the word embedding and the one-hot encoding method is that the one-hot encoding method splits each text into a group of words and turns each word into a sequence of numbers disregarding the word meaning within context [25], unlike the word embedding technique that transforms each word into a dense vector called a word vector that captures its relative meaning within the document [26] using GloVe algorithm [27] implemented by the embedding layer [28].
In the word embedding technique, every message m is a sequence of words: w 1 , w 2 , w 3 , …, w n ; each word is presented as a word vector of length d. After that, all word vectors of a given message (i.e., n word vectors) are concatenated to form a word matrix M ∈ R n×d . Finally, DCF receives the word matrix through the input layer and performs the convolution operation to produce feature maps through the convolutional layer.

Deep convolutional forest
The deep convolutional forest (DCF) model employs a cascade approach inspired by the structure of deep neural networks and deep forest [8] as shown in Fig. 2. Each level receives processed feature information from its prior level and outputs its processing outcome to the posterior level. The output of each level is the probabilities of both classes: Spam and Not-Spam; these probabilities are then concatenated with the feature maps to form the input of the next level. The model predicts the class of a given message by taking the average of the probabilities of Spam and the probabilities of Not-Spam separately from the last level output, and then it takes the maximum average as a final prediction result.
The model accuracy is the main factor that determines the number of levels. Whenever the accuracy of validation data increases, DCF continues to generate new levels and stops when there is no significant improvement in the accuracy score, unlike deep neural networks where the number of hidden layers is a pre-defined parameter. As a result, DCF is applicable to different scales of datasets, not limited to large-scale ones, as it automatically adjusts its complexity by terminating the training process when adequate. The core units of each level in DCF are the convolutional layer, the pooling layer and the classification layer, which contains four base classifiers: two random forests [11] and two extremely randomized trees [29]. The convolutional layer is responsible for the feature extraction task, whereas the pooling layer helps reduce overfitting in the proposed model. Moreover, the classification layer predicts the probabilities of Spam and Not-Spam for a given message.
DCF combines the advantages of two techniques: bagging and boosting, they work interchangeably for decreasing variance and bias [30,31]. Bagging refers to a group of weak learners that learns from each other independently in parallel and combines the outcomes to determine the model average [30], whereas boosting is a group of weak learners that learns from each other in series where the next learner tries to improve the results of the previous learner [30]. DCF represents bagging through combining outputs from each forest in the classification layer as well as using Random Forest as a base classifier, which combines predictions from each decision tree and outputs the model average. Moreover, DCF supports boosting since it keeps adding levels where the next level tries to correct errors present in the previous level.

Convolution operation
The convolutional layer extracts hidden features from the textual data by performing the convolution operation on a word matrix and applying the Rectified Linear Unit (ReLU) activation function [32] on the output. Let M ∈ R n×d be the input word matrix having n words and d-dimensional word vector, a filter F ∈ R d×k slides over the input, resulting in a feature vector O of dimension n − k + 1, also known as a feature map, where k is the region size of the kernel. The process of finding the feature vector assuming that k = 2 is shown in (1). Let Then Where is the convolution operator, which results in a feature vector O of length (n − 2 + 1) in which each feature is calculated as follows: The feature vector O passes through the Rectified Linear Unit (ReLU) activation function. As shown in (2), ReLU takes each value O i and returns O i if the value is positive; otherwise, it returns zero, meaning that ReLU finds the maximum value between O i and zero. This value is noted byÔ i .
The output of the convolutional layer after applying several filters becomes a set of feature maps as each filter produces one feature mapÔ of length (n − 2 + 1) having positive values only.Ô

Pooling operation
The pooling layer in DCF applies the pooling operation for downsampling feature maps to avoid overfitting [33]. Pooling is a process that aggregates the output of each filter by pulling a small set of features out of large sets to knock down the amount of computation required for processing the next level. Hence, pooling should reduce overfitting, which arises from high model complexity and causes misclassification of unseen data as the model learns the noise in textual data [20,21,25]. In addition, DCF supports the early stopping procedure, meaning that the model stops training when the performance starts to degrade. Pooling has three common variations [7]; max-pooling is the one that yielded better results than min-pooling and average pooling.  (3).
In the end, the features extracted using the convolution and the pooling layer (i.e., O) pass through the classification layer. Algorithm 1 provides the detailed steps of the feature extraction process, considering that the number of features is equal to the number of filters. for j = 1 to L do 3: end for 7: return O 8: end function

Random forests and extremely randomized trees
The base classifiers in the proposed model are random forests [11] and extremely randomized trees [29], which rely on the decision tree ensemble approach. Ensemble methods improve prediction results by combining multiple classifiers for the same task [9]. Each level in DCF involves different forest types to support diversity; diversity is crucial in constructing ensemble methods [9].
Both forest types: random forests [11] and extremely randomized trees [29], consist of a vast number of decision trees, where the prediction of every tree participates in the final decision of a forest by taking the majority vote. Furthermore, the growing tree procedure is the same in both techniques as they select a subset of features randomly, but they have two exceptions, as explained below.
Random forests build a decision tree by subsampling the input and selecting a subset of features randomly. Then, choosing the optimum feature for the split at each tree node according to the one with the best Gini value [31], whereas extremely randomized trees manipulate the whole input and choose a random feature for the split at each node.

Final prediction
As shown in Fig. 2, the DCF first level extracts features from a text through the convolutional layer and the max-pooling layer. Next, each forest in the first level classification layer takes the extracted features and outputs two probabilities: the probability of a given message being Spam and being Not-Spam. After that, DCF computes the accuracy of the first level to be compared with the new accuracy of the second level. The second level in DCF produces new features, which are then concatenated with the probabilities generated by the first level to form the input to the second level classification layer, which outputs new probabilities (predictions). A new accuracy is calculated and compared with the previous accuracy so that DCF will continue to generate levels until it finds no significant improvement in accuracy or reaches the maximum number of levels. Each level has a convolutional layer, a max-pooling layer and a classification layer consisting of four forests: two random forests [11] and two extremely randomized trees [29]. Hence, each level in DCF outputs eight probabilities (i.e., four probabilities for each class).
The last level in DCF takes the average of probabilities for Spam and the average of probabilities for Not-Spam; the higher average value will be the final prediction. Algorithm 2 provides the full implementation of DCF.

Experimental results
This section analyzes the model performance in detecting Spam messages gathered from the UCI repository [22] and compares the results with multi-grained cascade forest (gcForest) and the traditional machine learning classifiers as well as the existing deep learning techniques. As shown in Table 2, the number of spam instances is extremely lower than the number of legitimate ones, so balancing the class distribution is necessary to obtain accurate results. Initially, the dataset was split into two subsets: 80% of the messages are for training and the remaining 20% are for testing and validation. Then, the SMOTE [34] over-sampling technique was applied for balancing data before feeding it into the classifier.

Configuration
The proposed approach was implemented with Python 3.7 along with TensorFlow [35], Keras [36] and Scikit-learn [37]. The embedding layer in Keras converted textual data into word vectors using GloVe word embeddings, which contains pre-trained 100-dimensional word vectors. The con-  volutional layer yielded the most promising performance by using 64 filters for applying the convolution operation on each input, where each filter (kernel) is a two-dimensional array of weights that moves one unit at a time (i.e., stride is set to 1). Although many studies used a different number of filters in each convolutional layer in convolutional neural networks [5], DCF uses the same number of filters as there is no significant change in performance and to facilitate the process of tuning hyper-parameters. The experiment showed that max-pooling with a pool size equals the size of the input was better than min-pooling and average-pooling. Moreover, each forest in the classification layer contained 100 trees. However, using more trees failed to increase the accuracy. All the remaining parameters of the base classifiers were set to default. Consequently, the proposed model predicted Spam messages with an accuracy equals 98.38% after generating two levels only. Table 3 summarizes the configuration setup of DCF.

Evaluation metrics
As discussed in "Deep convolutional forest", the accuracy score is the main factor in determining the number of DCF levels. After each level, DCF estimates the performance on the validation set until it finds no significant gain in performance. The experiment showed that two levels were enough to classify Spam messages. As a result, the training procedure was terminated and the model was evaluated on the test set based on the following well-known classification metrics: -Precision: determines the ability not to label a negative (Not-Spam) message as positive (Spam).
-Recall: determines the ability to find all positive (Spam) messages.
-F1-score: is a weighted average of precision and recall.
-Accuracy: compares the set of predicted labels to the corresponding set of actual labels.
-Receiver operating characteristic (ROC) curve: plots True Positive Rate (TPR) on y-axis as defined in (8) and False Positive Rate (FPR) on x-axis as defined in (9). The area under the ROC curve (AUC) measures the model performance; the higher the AUC value, the better model.
According to the confusion matrix described in Table 4, the proposed algorithm identified Spam messages with low false-negative and false-positive rates. Moreover, from the data in Table 5, it can be seen that DCF performed well on the test set, resulting in high precision, recall, f1-score, accuracy, and AUC score.
The goal of constructing ensemble models is to minimize the generalization error. As long as the individual learners are diverse and independent, the prediction error of the ensemble model decreases [38]. DCF encourages diversity by employing different structures of forests as base classifiers. Table   Table 4 The confusion matrix of DCF  6 shows that the results of DCF having four forests of the same type (i.e., four random forests) are indeed worse than having four forests with diverse building strategies as shown in Table 5. Hence, the diversity affects the performance of detecting Spam messages. The cross-entropy loss detects if the model suffers from overfitting as computed in (10), where y ∈ {0,1} is the true label of a single sample and p is the predicted probability. In a good fit model, the loss should keep decreasing till reaching a point of stability whenever the number of levels is increasing. When using pooling layers during the experiment, the cross-entropy loss decreased from 0.101 to 0.084, while removing pooling layers caused an increase in the loss from 0.131 to 0.182. Upon further analysis, adding pooling layers after convolutional layers enhances the learning performance.

Machine learning classifiers
To apply machine learning classifiers: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbors (KNN) and Random Forest (RF), text preprocessing techniques such as tokenization, removal of stop words and stemming were applied for extracting features manually from the SMS spam dataset. Table 7 presents 10 features that were extracted after data preprocessing. As indicated in Table 8, DCF outperformed other classifiers in categorizing Spam and Not-Spam messages in terms of precision, recall, f1-score and accuracy. According to the ROC curve in Fig. 3, the AUC score of the proposed model is significantly higher than the other classifiers, considering that the hyper-parameters of the mentioned machine learning algorithms were set to default during the experiment.

Deep learning techniques
Convolutional neural networks (CNN) and long short-term memory (LSTM) were implemented to compare their results with DCF. The number of convolutional layers affects the performance of CNN [7]. Hence, three models of CNN were applied: the first model has one convolutional layer (1-CNN), the second model has two convolutional layers (2-CNN) and the third model has three convolutional layers (3-CNN). All of the mentioned deep learning models start with an embedding layer to generate 100-dimensional word vectors using GloVe; these word vectors are then used as inputs to the convolutional layer or the LSTM layer to produce feature maps. The convolutional layer in CNN has 64 filters of size 2 to match the configuration of DCF, and the number of units in LSTM was also set to 64. The models used the ReLU activation function as well as applying the Adam optimizer to reduce the error rate, in addition to adding a max-pooling layer in CNN models to avoid overfitting. Finally, the output    (11) to find the final decision, where z is the input and k = 2 is the number of classes (i.e., Spam and Not-Spam). Table 9 indicates that the proposed model realized the best performance with respect to precision, recall, f1-score and accuracy. In addition, DCF and 1-CNN achieved an equal AUC score, as indicated in Fig. 4.

Multi-grained cascade forest
The gcForest model proposed in [8] depends on the multigrained scanning procedure for feature extraction. However, this procedure is not capable of manipulating textual data. So in order to evaluate the gcForest model on the SMS spam dataset, the messages were represented by GloVe word Bold values indicate the best result for each classification metric embeddings before feeding to the model. Considering that the word embedding method conveys semantic relationships, unlike the TF-IDF method [40]. Table 10 signifies that DCF achieved better performance than gcForest. Nevertheless, TF-IDF features led to poor performance on DCF as the accuracy decreased to 75% compared with word embeddings.

Discussion
This paper introduced a dynamic (self-adaptive) deep ensemble technique to classify Spam and Not-Spam messages with remarkable classification results compared to the methods described in literature. The model suggested in this paper outperformed machine learning algorithms as well as deep learning models since ensemble learning connects the decisions from individual learners to improve the final decision. Moreover, DCF exceeded the outcomes of gcForest [8], since DCF carries the high-level features of textual data and maintains the semantic relationships. The introduced model extracted hidden features from data with the help of convolutional layers and pooling layers, unlike machine learning classifiers that require manual feature extraction from textual data, which requires domain knowledge. Furthermore, deep learning methods have fixed complexity, which means that they perform inefficiently on small-scale data. On the other hand, the proposed model can set the complexity automatically as the number of levels is determined according to the rate of accuracy increase, which means that it can perform efficiently on both small-scale data and large-scale data.
The main gaps in literature, which are addressed by the proposed algorithm, are stated as follows: -No domain expertise is required to carry out the classification process.  Bold values indicate the best result for each classification metric -Dynamic increase in the model complexity in proportion to the increase in performance.
To sum up, the model developed in this paper can separate legitimate text messages from fraudulent ones with high accuracy and low complexity. This filtering process will reduce the possibility of stealing people's sensitive data and will ensure that the users will be able to focus on messages from multiple industry sectors, which will help companies grow their businesses.

Conclusion
This paper presents a dynamic deep ensemble model for categorizing text messages into Spam and Not-Spam. The model starts from passing the word embeddings through convolutional and pooling layers to dispense with manual feature extraction. Then, the model sends the feature maps to the classification layers where the base classifiers: two random forests and two extremely randomized trees, carry out the predictions. Adopting ensemble techniques like boosting and bagging in constructing the model accomplished more accurate outcomes than single classifiers. Ensemble procedure is implemented by processing the input in a level-by-level manner until reaching the last level in which the average of class probabilities is calculated to take the highest average value representing the predicted label. This procedure facilitates the adjusting of the model complexity, unlike deep learning where the model complexity is determined in advance. As confirmed by the experimental findings, the proposed model surpassed the traditional machine learning classifiers as well as the existing deep neural networks in terms of precision, recall and f1-score in addition to achieving the highest accuracy rate of 98.38%. Overall, the suggested solution in this paper can significantly minimize the risks related to security attacks such as SMS phishing by filtering spam messages. Future work may include the detection of spam content written in different languages other than English. Furthermore, a slight change in the model architecture may be considered for classifying messages that involve images.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.