New weighted BERT features and multi-CNN models to enhance the performance of MOOC posts classification

Learning is an essential requirement for humans, and its means have evolved. Ten years ago, Massive Open Online Courses (MOOCs) were introduced, attracting many interests and learners. MOOCs provide forums for learners to interact with instructors and to express any problems they encounter in the educational process. However, MOOCs have a high dropout rate due to the difficulties of following up on learners' posts and identifying the urgent ones to react quickly. This research aims to assist instructors in automatically identifying urgent posts, making it easier to respond to such posts rapidly, increasing learner engagement, and improving course completion rate. In this paper, we propose a novel classification model for identifying urgent posts. The proposed model consists of four stages. In the first stage, the post-text is code-encoded and vectorized using a pre-trained BERT model. In the second stage, a novel feature aggregation model is proposed to reveal data-based relationships between token features and their representation in a higher-level feature. In the third stage, a novel model based on convolutional neural networks (CNNs) is proposed to reveal the meaning of a text context more accurately. In the last stage, the extracted composite features are used to classify the text of the post. Several experimental studies were conducted to get the best performance of the proposed stages of the system. The experimental results demonstrated the architectural efficiency of the proposed feature aggregation and multiple CNN models, as well as the accuracy of the proposed system compared to the current research.


Introduction
Massive Open Online Courses (MOOCs) date back to 2008, when videos were used as a medium of instruction. When Stanford University launched its free online courses in 2012, nearly 300,000 learners enrolled. Currently, the number of learners through MOOCs is more than 220 million worldwide [1]. Despite their popularity, MOOCs face many challenges, such as completion rate, accountability, accreditation, accessibility, and financial sustainability [2].
One of the most significant challenges facing MOOCs is the low completion rate, which represents a major obstacle to achieving the goal of the educational process itself. One of the main reasons for it is the lack of interaction between the instructor and the learner [3][4][5]. MOOCs provide forums for learners to communicate with one another and instructors [6]. Hence, it provides a means for the learners to express the problems and obstacles encountered during the learning process [7,8]. Despite numerous learners' posts during the course, only about 20% are urgent and require the instructor's attention [7]. As a result, the instructor is overwhelmed with identifying urgent posts that require a quick response. Automatically informing instructors in real-time of urgent posts is one of the most important tools for improving engagement, reducing dropouts, and increasing completion rates [9].
The identification of urgent posts is a text classification issue. Text classification problems are solved using traditional machine learning and deep learning techniques combined with natural language processing techniques. These technologies are employed in many applications, such as Twitter sentiment analysis [10], YouTube comments [11], emotion text classification [12], text plagiarism detection [13,14], and educational data mining [11].
To categorize MOOCs forum posts into urgent and nonurgent topics, extensive studies have been conducted based on word representation and classification approaches. In [7,8,15], statistical methods such as term frequency (TF), inverse document frequency (IDF), and term frequencyinverse document frequency TF-IDF are used to convert text terms into numerical representations. The text's word sequence is disregarded by these methods [16]. This leads to a defect in comprehending the document context, as the meaning of the context depends on the words' order and the relationship between each word and its neighbors. In [17][18][19], deep learning models are introduced for word representation, and each word is depicted by a dense vector that reflects its significance within the context of the document [20].
In [7,8,15], traditional classification algorithms, such as nearest centroid, SVM, and others, are constructed to classify MOOCs forum posts. These algorithms are straightforward and demand little computing power, but their performance largely depends on human correction, which helps in the feature selection process. When these algorithms give poor performance, humans intervene to redevelop experiments using feature selection techniques to select the effective features that can improve the performance. Additionally, these techniques can be learned from data sets that are comparatively small. Therefore, the best obtained F1-weighted score of these algorithms is 88% and 70% of the class of urgent posts.
On the other hand, deep learning classification methods can improve their results through repetition without human involvement. They are also able to fit a large quantity of data and weigh the contribution of each feature to the decision-making process. In [17][18][19], deep learning algorithms are utilized for word representation and classification process. Pre-trained deep learning models such as Google News and GloVe are used for word embedding to extract more text numerical representation features.
Multiple CNN, GRU, and attention layers are also constructed to develop the classifier. Their efforts focused on additional representational features, selecting the effective features, and giving the most significant features more weight. However, the best obtained F1-weighted score of these methods is 91.8% and 80.1% of the class of urgent posts, where intra-relationships between word features and inter-relationships among text words were not taken into consideration in revealing the post-context meaning that leads to detecting the urgent posts.
In this study, a novel MOOC post-classifier is proposed to assist instructors in quickly accessing and responding to urgent posts. It considers disclosing post-context meaning, which helps classify urgent posts accurately. It is interested in embedding each word into effective numerical representational features, extracting intra-relationships between word features, and inter-relationships between text words that capture the post-context meaning. The proposed model consists of four stages: tokenizing and embedding, aggregation and weighting of token's features, extraction of post-context meaning, and classification.
Generally, the main contributions of this research are summarized as follows: • A pre-trained BERT model is employed for tokenization and embedding processes; it can return different vectors for the same word depending on its context. • A novel feature aggregation model for token's features is proposed to uncover data-driven relationships between them and express them as a higher-level feature. It also considers the aggregated feature's weight with the word meaning's diversity within the posts. • A novel parallel CNNs model is constructed to extract the composite features of multiple words to reveal the post-context meaning. It consists of four different CNN architectures that can extract the different relationships between the sentence words and uncover the postcontext meaning. • The relevance of the post can be determined by analyzing the extracted composite features through the proposed neural network architecture in the classification stage of the proposed system. • Experimental comparative studies are conducted to evaluate the design efficiency of the feature aggregation model of token's features and the multi-CNN model to reveal the post-context meaning.
The proposed system performance is evaluated and compared with state-of-the-art algorithms. The experiments were conducted using three groups of training and test datasets from the benchmark database of the Stanford MOOC post corpus, as predefined in [8,18,19]. The experimental results showed that the proposed model, compared with the other algorithms, achieved a significant improvement in the weighted F1 score and a balance between the precision and recall scores of urgent posts.
The remainder of this paper is organized as follows: In Sect. 2, the related work will be explored. In Sect. 3, the proposed approach is explained in detail. In Sect. 4, the experiments and obtained results will be presented and discussed. Finally, in Sect. 5, the conclusion will be presented.

Related work
Due to the recent extensive use of MOOCs, several classification algorithms for MOOC post forums have been proposed. These algorithms can be divided into two categories based on the techniques employed: traditional machine learning and deep learning approaches [18].
Feng et al. [21] analyzed more than 100,000 discussion threads collected from Coursera. They found that most of the posts were unrelated to the course content. The posts were based on new features related to user interactions with different subforums. Linear regression combined with a gradient lifting decision tree (GBDT) is used to enhance the classification of a discussion thread. The advantage of this model is that it is based on features independent of the course content thread. It achieved an accuracy of 85%, which is an improvement of 12% compared to the baseline results [22].
Agrawal et al. [7] proposed a labeled MOOCs dataset. They proposed a two stages system, in the first stage, a classifier model is developed to identify confusing posts. In the second stage, a recommendation system is applied to recommend a short clip clarifying that confusion. The proposed solution achieved an F-score value of 77% of the confused label, depending on the course. Cui and Wise [23] used the binary support vector machines model to classify whether questions posts were related to course content or not. Bakharia et al. [15] conducted a comparative study to classify the Stanford MOOCs posts according to confusion, urgency, and sentiment labels, they compared the performance of three classifiers Naïve Bayes, support vector machine (RBF), and random forest. Almatrafi et al. [8] used a combination of metadata and linguistic features to build MOOCs urgent posts identification model. According to their results, the AdaBoost algorithm achieved the best results.
All the previous methods used traditional machine learning algorithms, and even the best algorithm only managed to maximize the F-score value to 77% of the urgent label, which is insufficient. These results are particularly referred to the nature of MOOC posts, which are very short in most cases, contain spelling errors, and are very noisy. Therefore, the researchers used deep learning techniques to handle these problems.
The recent improvements in hardware and computing power allow powerful deep learning algorithms to be applied and developed for large datasets. Classifying text using deep learning depends on embedding techniques to represent text. Generally, the text consists of words and characters, and different embedding techniques can represent both. Then, deep learning methods utilize the word and character embedding vectors to make the decision [18].
Ombabi et al. [24] proposed an opinion analysis algorithm using Twitter to summarize user interests. They used pre-trained Word2Vec as a word embedding technique and a combination of CNN and support vector machine (SVM) for opinion classification. SVM provided the final prediction based on the features and semantic information extracted by CNN. Sotthisopha et al. [25] proposed a short text classification algorithm based on multichannel CNN and the k-max-pooling layer. Also, they added preprocessing data module to maximize the coverage of word embedding. XI GUO et al. [18] proposed a hybrid model using features extracted from word embedding and character embedding. They evaluated the performance of the proposed model depending on the Google-news Vectors and GloVe. They reported that the proposed pre-trained word embedding model based on Google-news vectors offers a better result than that based on GloVe. Khodeir [19] used BERT as an embedding technique and Bi-GRU to build the classification model. However, the proposed solution slightly enhanced the results. Although the stateof-the-art algorithm achieved a weighted F1 value of 91.9%, this result does not accurately reflect the enhancement in the classification of urgent posts. The class ''not urgent'' achieved an F1 value of 94.8%, whereas the ''urgent'' achieved an F1 value of 81.2%. The results indicate that the model is still unable to identify urgent posts.

Proposed system
Identifying urgent posts in MOOCs becomes a significant challenge for instructors as the number of students and their posts grows. Therefore, an accurate model is proposed to classify MOOCs forum posts into urgent and non-urgent topics. It assists in prioritizing replies and managing many posts. The goal of the proposed model is to improve engagement, reduce dropouts, and increase completion rates.
The proposed model is based on deep learning approaches for word representation and classification processes. It consists of four stages, tokenizing and embedding, aggregation and weighting of token's features, extraction of context meaning, and classification, as shown in Fig. 1.
In the first stage, the proposed model takes a post-text as an input and tokenizes it into effective numerical representational features. Then, each token is represented by a vector. In the second stage, a novel features aggregation model is proposed to uncover data-driven relationships between token's features and represent them as higherlevel features. In the third stage, a novel multi-CNN model is constructed to reveal the context meaning of post-text, considering the discovery of the post meaning based on the different relationships between the sentence words. In the fourth stage, the extracted features are utilized to classify the post-text.

Tokenizing and embedding stage
The post-text for MOOCs is in an unstructured format. Therefore, the first stage of the proposed system is designed to preprocess the post-text and prepare it for use in the subsequent stages, which converts the post-text into numerical representation, as shown in Fig. 2. This stage employs the BERT model to perform tokenization and embedding processes on the input post, as shown in Fig. 3. The contextual token's embedding value depends on the token's position, segment's embedding, and token's embedding [26]. The main advantage of using the BERT model is that it can interpret the context of a word; it returns different vectors for the same word depending on the words around it.
For example, the following text was tokenized and embedded by the BERT model. The BERT model first tokenized the text, and each token is represented by a predefined value depending on the words around it, as shown in Fig. 4.
The word ''bank'' appeared in three positions with two different meanings. The BERT model represented each word with a vector proportional to its meaning, as shown in Table 1. The similarity between the word ''bank'' in the first two positions, ''bank vault'' and ''bank robber,'' equals 0.94, while the similarity in the last two positions, ''Bank robber'' and ''River bank,'' equals 0.69. The results Fig. 1 Overall structure of the proposed system demonstrate the advantage of using the BERT model to generate an embedding vector. This capability of the BERT model is utilized to improve token's representation and model performance.
The proposed system is based on a pre-trained BERT-Base-Uncased model that is not sensitive to the case of the letter if it is a capital or small letter. The pre-trained BERT    model transforms each token into f features. It has a fixed input length of n, where n is the number of input tokens in the text. A text with a larger number of tokens is terminated at token n. If the text length is shorter than n, the text will be padded. The input post will be transformed into a P tok array, as expressed in Eq. (1).

Aggregation and weighting of token's features stage
Feature aggregation is a technique that builds a global feature vector by integrating the various local features of a dataset instance to form the global features of data. In the pattern recognition field, the feature aggregation process is a method that takes many local features from an image and combines them into a single global feature vector. The purpose of aggregating features is to uncover data-driven relationships between instance features that may be difficult to detect. Each aggregated feature can be considered as a higher-level feature that epitomizes multiple lower-level features. Therefore, these higher-level features can reveal significantly more useful information than any single local feature.
In this study, a novel feature aggregation model is proposed. The goal of this model is to uncover data-driven relationships between token's features and represent them as higher-level features. Furthermore, the proposed model considers the weighting of the aggregated feature ''token's global feature'' in different situations, especially when using the same token in different posts. This makes the model more flexible with the diversity of the word meaning within the posts, as each word may carry different meanings and weights in determining the importance of the post.
The proposed model aggregates and weights each token's feature based on a deep learning approach. It consists of one convolution layer, with E filters of one size equal to 1 9 Fs. Each filter combines the f local features of a token and expresses them as a global feature. The proposed CNN model slides a filter over the input tokens. The token's feature values are multiplied by their corresponding values in the filter. Then, the result is summed up into a global value in the output channel/feature map to uncover data-driven relationships between the token's features. This global feature will be weighed into E values depending on the diversity of word meaning and importance of a word within the posts.
It weights into E values using E filters that reflect the number of convolutional layer channels.
In this stage, as shown in Fig. 5, the proposed aggregation and weighting model of token's features takes the P tok array as input and extracts the P wg array as expressed in Eq. (2).  (4) illustrates how the number of output vectors Dn is computed, where f is the number of token's features, Fs is the filter width that has the same value of token's features, p is the padding value, s is the stride value, Fh is the filter height that equals to one, E is the number of channels, and each feature value F j of Tw i in P wg array is completed as indicated in Eq. (5), where g is a rectified linear unit (ReLU) function, and b ei is the bias.
where f e ¼ W e Ã T i þ b ei

Extraction of context meaning stage
The text consists of related words. The context of the text is always based on a set of closely related words that sheds light not only on the meanings of single words but also on the meaning and purpose of the entire text. Therefore, the context of the text is the essence of the intended meaning in any textual or verbal structure. It sheds light not only on the word but also on the written text and the overall meaning through the relationship of the vocabulary to each other in any of the different contexts. The sentence meaning is revealed only through the contextualization of the linguistic unit, that is, placing it in different contexts. Therefore, in this study, a new proposed model aims at revealing the context of the post and recognizing its importance; it considers the discovery of the meaning based on the different relationships between sentence words.
The purpose of this stage is to extract the composite features of multiple words to reveal the context of the post. It is based on a new multi-CNN model. This model is constructed with different convolutional filter sizes to extract the different relationships between the words in a text, which can reveal the post-context meaning. The proposed multi-CNN model consists of four parallel CNNs, as shown in Fig. 6. Each of them has two layers: convolutional and pooling layers. The first CNN, with a filter size of 1 9 F DE , is used to extract the best features of a single word that can be expressed within the context meaning of the post. The second CNN, with a filter size of 2 9 F DE , is used to find the set of two words features that can be highlighted in the context meaning of the post. The third CNN, with a filter size of 3 9 F DE , is developed to extract the expressive features of each of the three words in the context meaning of the post. The fourth CNN, with a filter  size of 4 9 F DE , is structured to discover the best features of each of the four words that are prone to reveal the context meaning of the post. In this stage, the proposed model slides the convolutional layer filters of CNNs over the input tokens. In the multiple convolutional layer filters, the token's feature values are multiplied by their corresponding values. Then, the outputs of the convolutional layers are fed to the corresponding pooling layers to uncover the feature values of different groups of words, which can emphasize the context meaning of the post. The proposed multi-CNN model is input by the P wg array of the previous stage. Then, for each CNN, P Cm arrays are constructed as expressed in Eq. (6), where m is the number of CNN. Equation (7) illustrates how the number of output vectors Cm is determined, and Eq. (8) illustrates how the number of output features F C is computed, where f cl is the filter width that has the same value as token's features DE, Fh is the filter height size, p is the padding value that equals zero, and s is the stride value that equals one. where After constructing the output features of each CNN as shown in Fig. 6, the proposed model concatenates the P Cm arrays into a single P C array, as indicated in Eq. (9).

Classification stage
As previously indicated, in the third stage, the different relationships between the words in the text are extracted, which can reveal the post-context meaning. These relationships are expressed as composite features, which are computed and concatenated into a P C array. The feature values of the P C array are fed to the fully connected layer in the last stage of the proposed system, as shown in Fig. 7. The fully connected layer is contained by ReLU, equal to the number of features in the P C array. Then, the output of the ReLU is fed to the output unit of the sigmoid function. Equation (10) illustrates how the output of the proposed system that detects the importance of the post is determined, where g is a sigmoid activation function, and b o is the bias.

Experimental results and discussion
Several experiments were conducted to test and evaluate the proposed system and its stages. The first set of experiments evaluated the structural efficiency of the proposed system. The performance of the overall structure of the proposed system was also evaluated and compared to stateof-the-art algorithms in the second set of experiments.

Dataset and evaluation metrics
The experiments were conducted using a benchmark dataset of the Stanford MOOC post corpus proposed by Agrawal et al. [7]. It contains posts related to 11 public online classes from Stanford University. The dataset courses are categorized into three domains: Humanities/ Sciences, Medicine, and Education. Each domain contains nearly 10,000 posts, and the total number of posts is 30,002. Agrawal et al. [7] manually classified the data into six dimensions, namely question, opinion, sentiment, urgency, and confusion, on a scale of 1 to 7. Table 2 shows an example of the dataset. Agrawal et al. [7] excluded about 398 posts with malformed or missing scores, reducing the total number of posts to 29,604. For each post, metadata includes up-votes, number of reads, post position, etc. [7]. This study aims to classify the posts into ''urgent posts'' and ''not urgent posts.'' Therefore, the class labels of the corpus were modified to binary classification, with the urgent label approximated to 0 if it is lower than 4 and approximated to 1 if it is greater than or equal [7,8,18,19], and [13]. After the approximation, the urgent  Figure 8 shows the number of posts per label in the original labeling. Figure 9 shows the number of posts after approximation.
The state-of-the-art algorithms [8,18], and [19] used on the Stanford MOOC corpus dataset divided the posts into three different scenarios: Groups A, B, and C.
• Group A: this group simulates the general case in which the training and test datasets were independent of the course or domain. • Group B: the data were split into training and test datasets depending on the course name; all posts related to some courses were selected as the training dataset, whereas the posts related to the other courses were selected as the test dataset. • Group C (a Domain out): the data were split into training and test datasets depending on the domain; Medicine and Education domains for the training and evaluation processes and Humanities domain for the test process.
The proposed system performance was evaluated and compared with state-of-the-art algorithms based on recall, precision, accuracy, and F-score metrics. The recall was used to calculate the percentage of correctly classified posts, as indicated in Eq. (11). The precision was used to calculate the ratio of correctly classified posts, as shown in Eq. (12). The accuracy of the proposed system represents the percentage of the total classified posts, as shown in Eq. (13). The F1-score represents the relationship between precision and recall, as shown in Eq. (14). It is a good measure of unbalanced data.
where TN, FN, FP, and TP are the true negative, false negative, false positive, and true positive, respectively.

Structural efficiency evaluation of proposed system
The proposed system consists of four stages. In the first stage, the post is preprocessed so that it can be fed into the  subsequent stages, which transform the post-text into a numerical representation P tok array. It uses a pre-trained BERT model to perform tokenization and embedding processes on the input post. The pre-trained BERT model used a fixed input length of n = 512 of the input tokens in the text.
In the second stage, the new proposed feature aggregation model of the token's features ''P tok array'' was applied, as shown in Fig. 1. It aimed to uncover data-driven relationships between the token's features and express them as higher-level features. It consists of one convolution layer, with E filters of one size equal to 1 9 768. Each filter combines the 768 local features of a token and expresses them as a global feature.
Several experiments were conducted to determine the best values of the number of filters E. The best value of E is 350, as shown in Table 3, which has the highest accuracy of the post-classification compared with the other values of E. The experiments indicate that 350 is the best value of the number of filters representing the number of different weighting values of the aggregated feature ''token's global feature'', which made the model more flexible with the diversity of a word meaning within the posts.
To evaluate the effect of the aggregation and weighting of the token's features stage on the proposed system performance, six experiments were developed. As a result, in Figs. 10 and 11, the area under the receiver operating characteristic curve (AUC) is shown. It can be seen in these figures that the feature aggregation stage has a positive effect, and the proposed system based on this stage achieved the highest AUC values on the datasets of groups A, B, and C. In the third stage, the post-context meaning was revealed. It is based on a novel proposed multi-CNN model that extracts the composite features of multiple words. It consists of four CNNs. Each of them contains two layers: convolutional and pooling layers. The first CNN, with a filter size of 1 9 350, is used to extract the best features of a single word that can be expressed within the context meaning of the post. The second CNN, with a filter size of 2 9 350, is used to find the set of features of each of the two words that can be highlighted in the post-context meaning. The third CNN, with a filter size of 3 9 350, is used to extract the expressive features of each of the three words in the context meaning of the post. The fourth CNN, with a filter size of 4 9 350, is used to discover the best features of each of the four words that could reveal the post-context meaning.
Several experiments were conducted to evaluate and assess the structural efficiency of the proposed multi-CNN model to reveal the post-context meaning. The proposed multi-CNN model based on extracting the composite features of each word, two words, three words, and four words of the post-text is the efficient structure, as shown in Table 4. It can identify the most effective composite features, resulting in the best post-classification performance.

Comparison with state-of-the-art algorithms
The performance of the overall structure of the proposed system was evaluated and compared to state-of-the-art algorithms using groups A, B, and C datasets of the Stanford MOOC post corpus. These algorithms focused on additional data preprocessing and representational features, selecting the effective features, and extracting the longterm dependencies between the post words. Almatrafi et al. [8] proposed MOOCs posts classification model. It is based on TF, linguistic and metadata features, and the AdaBoost classification algorithm. It used the TF technique for word representation to convert text terms into numerical representations. TF ignores the word order of the text. The context's meaning relies on the words' order and the relationship between each word and its neighbors, which leads to a defect in understanding the document context. In addition, linguistic features are extracted using Linguistic Inquiry and Word Count (LIWC), a text analysis tool that depends on word count. The LIWC performance is negatively impacted by misspellings, symbols, and expressions, which limits the AdaBoost performance.
Guo et al. [7] developed a model based on google-news, metadata features, and architecture of CNN and Bi-GRU layers. It used google-news for word embedding. Google-  news created a vector that represents the word's absolute meaning, while ignoring the word's context meaning. The proposed CNN and Bi-GRU layers architecture emphasized the extraction of long-term dependencies between post words. Nabila [8] proposed model based on preprocessing, BERT, and Bi-GRU techniques. Text data preprocessing was used to eliminate stop words and special marks such as ''!''. These stop words and punctuation marks may be helping to better understand the context of the posts, which negatively affects the classification accuracy. Bi-GRU was used to build the classification model depending on the extraction of long-term dependencies between words. Despite these efforts of the state-of-the-art algorithms, the best obtained F1-weighted score of these methods is 91.9% and 81.2% of the class of urgent posts, as shown in Tables 5, 6, and 7. On the other hand, the proposed model takes into account uncovering data-driven relationships between the word features and weighting the extracted dependencies with the diversity of the meaning of a word within the posts. It also considers extracting the various relationships between the sentence words to disclose the post-context meaning. The proposed model achieved 83.6%, 83%, 83.3%, and 92.7% in precision, recall, F1 of urgent, and F1-weighted, respectively, as shown in Table 5. It obtained an improvement rate equivalent to 2.0%, 1.5%, 2.1%, and 0.8% over the best state-of-the-art algorithm results on group A dataset. It also obtained an enhancement of the F1-urgent scores by 0.8% on group B dataset compared with [7] and by 0.7% compared with the state-of-theart algorithm result [8], as shown in Table 6. In addition, the proposed model achieved an improvement on the overall F1-weighted score by 0.3% on group C dataset compared to the state-of-the-art algorithm result [8] and nearly maintains the same performance in the urgent detection.
Precision and recall scores reflect the trade-off between quality and variation, and they depend on the intended Bold values are the highest scores Bold values are the highest scores application. In this study, the numerical analysis of the experimental dataset indicates that the urgent posts account for 20% of all posts, yet they should be the focus of the instructor's attention. Therefore, lower precision and a higher recall score mean that the instructor must manually filter many posts. In contrast, higher precision and a lower recall score mean that the system cannot identify a large portion of urgent posts. A balance between precision and recall scores must be maintained for the importance of accuracy and remembrance. The experimental results of the proposed model on different test scenarios demonstrate the efficiency of the proposed model compared to state-ofthe-art algorithms, which achieved a clear improvement in the weighted F1 score, and it achieved a balance between precision and recall scores of the urgent posts, as shown in Tables 5, 6 and 7.

Conclusion
In this paper, an accurate MOOC post-classifier is proposed to increase the interactivity between instructors and learners. It consists of four stages. In the first stage, the post-text was tokenized using a pre-trained BERT pretrained. In the second stage, a novel features aggregation model was proposed to uncover data-driven relationships between the token features and express them as higherlevel features. In the third stage, a novel multi-CNN model was constructed to reveal the context meaning of the posttext. In the last stage, the extracted features are utilized to classify the post-text. The proposed system performance was evaluated and compared with state-of-the-art algorithms using a benchmark dataset of the Stanford MOOC posts corpus. The experimental results showed the efficiency of the proposed model compared with the other algorithms, which achieved a clear improvement in the weighted F1 score, and it achieved a balance between the precision and recall scores of the urgent posts.
Extending the research to improve the model's framework could be one aspect of future work. Sequential deep learning neural networks can be used to extract the semantic features that reflect the long dependencies between words, such as Bi-GRU and Bi-LSTM. Due to the data imbalance (urgent posts make up about 20% of all posts), there is a large discrepancy between the F1 value for urgent and non-urgent posts. Another direction for future research can be envisaged to address the imbalanced data problems using appropriate techniques such as oversampling and data augmentation. Future research could also mitigate the reasons for urgent posts, and whether they  relate to the content, logistical, or technical, we suggest using topic modeling techniques to determine the origin of trending posts and extract relevant topics. Analyzing the learner's behavior and opinions to explore the hidden factors that cause dropout can also be addressed.
Funding Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Data availability The datasets analyzed during the current study are available in the Stanford repository, http://datastage.stanford.edu/ StanfordMoocPosts/.

Conflict of interest
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.