Instructor-assisted question classification system using machine learning algorithms with N-gram and weighting schemes

One aspect of natural language processing, text classification, has become necessary in the educational domain due to the increasing number of students and the COVID-19 outbreak. The advent of the devastating pandemic and the need to remain safe have surged the discussions around online learning and integrated modules in teaching and learning. In this study, we employed machine learning to develop an automatic instructor-assisted question classification module for learning management systems. In selecting the best classifier, the conventional and the ensemble machine learning algorithms were compared using the tenfold and the fivefold cross-validation techniques. In addition, the N-gram feature selection mechanism and three weighting schemes were evaluated for performance enhancement. The detailed analysis indicates that the ensemble algorithms outperform the conventional ones with decreasing accuracy as the N-gram size increases. For all compared algorithms, the AdaBoost (SVM) ensemble algorithm has the highest accuracy of 78.55% for Unigram (TP, TF, TF-IDF). In addition, the AdaBoost (SVM) emerged with the highest F1-score of 0.782, whiles the ensemble Bagging (RF) algorithm had the highest ROC value of 0.955 for Unigram (TP).


Introduction
The COVID-19 pandemic remains one of the most significant tragic outbreaks that has necessitated numerous educational reforms globally [32]. The sudden closure of tertiary institutions in April 2020 affected schools in 185 countries, with 89.4% of total learners involved [40]. As countries battle to contain the spread of the virus and save lives, in-person teaching and learning was suspended indefinitely. The pandemic revealed flaws in the current educational system and created an opportunity for proactive policies in the eLearning space [32]. Even as institutions migrated hastily to online learning and distance education during the pandemic, it came with infrastructure, assessment, technological, pedagogical and financial challenges [40]. As the pandemic effect decreases in 2022 and learner enrollment increases, the eLearning adoption issues in higher education institutions remain precarious [19]. In June 2022, the UNESCO Institute of Statistics data indicated that the total number of tertiary students worldwide has doubled in the last two decades. There was a staggering 240% increase in tertiary enrollment in South and West Asia, East Asia and the Pacific from 2000 to 2020, while Central and Eastern Europe learner enrollment shot up by 84% within the same year. Sub-Saharan African countries averagely, had a 9.4% enrollment increase, with Mauritius alone having the highest gross tertiary enrollment of 40% by 2020.

Problem definition
The global economy, including education, has still not recovered from the devastating COVID-19 pandemic [9,27]. Even as countries continue to spring out novel policies to minimise the spread of the virus and rejuvenate the economy, student enrollment across tertiary institutions is still on the rise [41]. Most institutions globally have migrated to online and distance learning paradigms to limit physical contact and curtail the spread of the virus [28,34]. Traditional education has excellent advantages, including face-to-face meetings with instructors, instant feedback from the instructor when questions are asked, learner monitoring, sociability and solidarity, effective practical sessions, and comprehensive support services [17,36]. Even as the trade-off between online and traditional education continues to improve via blending learning [11], the instructors' ability to address all learner questions and emotions during the limited class sessions remains an issue. Even worse is the repetition of questions or similar questions that the instructor has already addressed. In constructivism and personalised learning, learners are expected to be active in the learning process and deduce their knowledge [2,14]. As learners construct new knowledge, continuous interaction with the instructor is relevant in shaping the contents learners are exposed to in the active process.

Purpose of study
In solving the need for a responsive instructor to the seemingly increasing number of students across tertiary institutions, the study proposes an automatic instructor-assisted question classification system using machine learning algorithms. In addition, we tested the relevance of N-gram and weighting schemes in building the classification model. In line with the purpose of the study, we pose the following research questions: 1 3 RQ1. Which machine learning algorithm has the highest accuracy in developing an instructor-assisted question classification system? RQ2. What other significant machine learning metrics contributed to selecting the best classifier for future prediction?
The rest of the paper is organised as follows: a brief introduction of natural language processing and question classification, a review of related literature, research methods, results and analysis, discussion and findings, conclusion, and future work.

Natural language processing and question classification
Natural Language Processing (NLP) is the basis for information retrieval in text and covers various application areas in computer science and artificial intelligence. With computational techniques, NLP produces human language content and abstraction by relying on mostly unstructured data [43]. Instead of encoding words as indices in dictionaries, NLP currently represents words as continuous vector forms with much lower dimensions and semantic similarities [38]. In online learning, students have the impersonal opportunity devoid of tension to ask diverse questions via text. The text using NLP can be segmented for the instructor to develop an automatic response system or address questions based on importance. The instructor, primarily via segmentation, has a labelled data set for context understanding and predictive modelling using machine learning or rule-based systems. A well-built model with high accuracy becomes a crucial basis for the instructor to check the course manual, recommended books, methods of teaching, and the curriculum for optimal learning.

Review of literature
Text categorisation and identification require question classification using machine learning algorithms [51] or rule-based systems [20]. Ehrentraut et al. [8] stated three stages as the basics of a question-answering system. In the first phase, known as question processing, students pose questions that act as input for categorisation and representation. The next step entails selecting documents and paragraphs based on text concentration. The final stage involves matching answers based on the questions' similarity with appropriate responses to the learner. Rule-based and machine learning methods are the two most common ways to construct a question classification system. The rule-based approach infers a knowledge base to detect text relations and provide answers based on a model. In contrast, the machine learning method utilises a dataset of labelled answers trained to extract features and predict the future categorisation of input questions. This review is limited to text and question classification using deep learning, ensemble and conventional machine learning techniques.
Zulqarnain et al. [51] analysed Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Networks (CNN), CNN-GRU and CNN-LSTM on the Turkish questions dataset. After applying the tenfold cross-validation technique, skip-gram, Word2vec and CBOW, CNN performed with the highest accuracy of 93.7%, using 300 features and skip-gram. Zhou et al. [50] created a hybrid neural network for a question classification system by combining CNN and BiLSTM. They implemented the fivefold cross-validation technique on Badu Knows and TREC benchmarks databases and compared results with SVM, LSTM and MaxEnt algorithms. Results show that the CNN-BiLSTM outperforms the other algorithms for Precision, Recall and F1 score metrics. Gaire et al. [10] compared their proposed RNN-LSTM with Multinomial Naïve Bayes (MNB), K-Nearest Neighbor (KNN) and Logistic Regression (LR) traditional machine learning algorithms. Using the Quora database, the study aimed to distinguish between sincere and insincere questions. The RNN-LSTM deep learning algorithms achieve a dominant Precision value of 0.654, Recall value of 0.733, and F1 score of 0.691. Zhen and Sun [49] presented a bagging-based CNN model using the Word2vec to map word features to proportional dimensions. They compared the model to Bayes and SVM conventional classification algorithms using bag-of-words, mutual information and IF-IDF methods. The results show that the W2V + B-CNN algorithm has a higher classification accuracy of 88.57% on the Chinese dataset from Harbin Institute of Technology's Information Retrieval Laboratory. Lei et al. [18] proposed an RR-CNN-based architecture for question classification by increasing the generalisation of sentence features. They used five datasets to compare their results to conventional SVM, NB, MNB, and other CNN-based approaches. The RR-CNN achieves the highest accuracy of 86.3% on average compared to tested algorithms. Upadhya et al. [42] applied an LSTM model to the Amazon community question dataset to identify the questions as yes/no or opinion-related. They discovered a 29%, 31%, and 23% difference in accuracy between their LSTM model and CNN for the dataset components electronics, beauty, and appliances, respectively. Razzaghnoori et al. [33] utilised three feature extraction methods, including clustering, feedforward neural network (FNN) and RNN, to design a question classification system. After comparing the LSTM to other neural networks and SVM utilising the aforementioned feature extraction techniques, an improved accuracy of 81.77% was realised.
Madabushi and Lee [20] compared the NB and SVM conventional algorithms for question classification in a networkbased online course. They identified 1120 keywords in building a predictive model. SVM performed better with higher values for Precision and Recall. Mohasseb et al. [23], in a similar study, implemented SVM, RF, DT and NB algorithms using a grammar-based approach instead of the bag-of-words or dictionary standard techniques. The dataset comprised 1,160 questions from a pool of Wikipedia data, TREC 2007 question answer data and yahoo non-factored questions. The DT classifier emerged as the best algorithm with an accuracy of 90.1% and superior Precision, Recall and F-score values. Hassan et al. [12] proposed a model comparatively using Gaussian Naïve Bayes (GNB), SVM, MNB, LR, RF and KNN. They used accuracy, precision, recall, and f1-score as performance metrics to determine the best model. Termed future engineering, they utilised bag-of-words and TF-IDF before building the final classifier. Results for classification show that the LR had an accuracy of 85.8% from the IMDB dataset, while the KNN performed best with an accuracy of 98.5% on the Spam dataset. Kamath et al. [15] presented a text classification model and compared the performance of the conventional NB, LR, SVM, MLP, and RF to CNN deep learning algorithm. They utilised the tobacco-3482 dataset and a health dataset from a public institution. Even though the LR algorithm had the highest accuracy of 77% among the conventional algorithms, the CNN performed with a superior accuracy of 83%. Yadav et al. [47] trained LR, SVM, NB and RF for text categorisation using conventional machine learning to ANN. The accuracy, precision, recall and F1-score analytical metrics formed the bases for comparing the algorithms after pre-processing. Results show a superior classification accuracy for ANN. Onan [25] compared conventional, ensemble and deep learning machine learning algorithms for sentiment analysis in massive open online courses (MOOCs). Simulation results show a dominant performance of deep learning methods over ensemble techniques. The ensemble methods also performed better than the conventional machine learning algorithms.

Research methodology
The research methodology is divided into four sub-sections: the proposed flow diagram, the dataset, the feature selection mechanisms, metrics for measuring performance and the algorithms implemented. Figure 1 depicts the flow diagram, which begins with the manual categorisation of textual questions from students. In this phase, we compile questions on the various components of the course. Tokenisers are used before data pre-processing to break down the unstructured questions without categories into chunks of data. During data pre-processing, affixes are removed from the data through stemming. In addition, common words called stop words and punctuation marks are equally removed from the text. The feature selection module consists of N-gram-sized phrases, including sizes 1, 2, and 3 representing unigram, bigram and trigram, respectively. Term Presence (TP), Term Frequency (TF), and Term Frequency and Inverse Document Frequency (TF-IDF) are then used to calculate the weight of terms in the text. The text and respective class categories are trained using conventional ML algorithms and ensemble ML techniques. The optimal classifier is then determined using classification metrics from the implemented fivefold and tenfold cross-validation and feature selection mechanisms.

The dataset
The questions for the dataset were collected continuously during the Database Management Course at the University of Education, Winneba, Ghana, in 2022. The dataset contains 1096 textual questions that are labelled under seven categories. As shown in Fig   Theory-based questions had the highest percentage labelling of 33. The practical questions frequency was 16%, followed by 15% for the course manual, 11% for AOQ, 10% for Assessment, and 9% for Attendance, with TLM having a minor count at 6%. As depicted in Table 1, questions are carefully matched with the correct categories.

Feature selection mechanism
In NLP, feature selection involves the selection of a subset of occurring terms in the training data. In the question classification system, feature selection primary is for two purposes: eliminating noisy words to increase classification accuracy and decreasing the training data size for efficiency. One feature mechanism utilised in the study is the N-gram model. An N-gram is a continuous sequence of tokens in a textual document. The N-gram feature technique is useful for predicting the next probable word in a sequence. The research utilised the unigram (N = 1), the bigram (N = 2) and the trigram (N = 3) as inputs during training. One aspect of the feature selection mechanism is the weight of a word in the question. After tokenisation, the Bag-of-Words (BOW) feature extraction method for word counting is implemented using TF, TP and TF-IDF weighting schemes. With TF-based weighting, the number of times a word occurs in the question are computed. In contrast, the TF-IDF scheme measures rare words, whereas the TP considers the existence of a word in the question.

Classifier performance evaluation metrics
Aside from using classification accuracy as an evaluation metric, the study examined revealing metrics including precision, recall, F-measure, and the Receiver Operating Characteristics (ROC-AUC) in comparing the performance of the algorithms. As shown in Eq. 1, classification accuracy influences the proportion of accurate predictions. It measures the proportion of accurate predictions to all predictions. Recall metric determines only the relevant positives. As shown in Eq. 3, the recall is the ratio of true positives to the true positives plus the false negatives.
The F-measure is a comparative metric for the performance of classifiers. As depicted in Eq. 4, the F-measure is the harmonic mean of precision and recall.
The ROC curve is a probability performance measurement metric that identifies classification difficulties at different thresholds. The ROC curve is a graph that plots True Positive Rate (TPR) against False Positive Rate (TPR). Area Under the Curve (AUC) determines the degree of separability between classes. The greater the AUC value, the better the classifier can differentiate between the specified classes.

Implemented machine learning algorithms
The algorithms utilised in this work fall into two primary categories: supervised learning or conventional traditional machine learning algorithms and multi-classifier ensemble algorithms.

Supervised learning algorithms
Supervised learning methods infer prediction based on a labelled dataset. The classifier generated from classification algorithms has input instances and matching class outputs. The primary objective of a supervised algorithm is to approximate a mapping function during training for future prediction [30]. The Random Forest (RF) algorithm constructs many decision trees and combines them to get a more precise and steady prediction. While splitting a node, the RF algorithm analyses a random subset of features for the most significant feature and generates a more accurate prediction model [26].
The Decision Tree (DT) algorithm is a tree-structured classifier in which branches represent decision rules, internal nodes represent dataset features, and leaf nodes represent class labels. In a DT algorithm, the decision nodes lead to multiple branches, while the leaf nodes are the outputs that cannot have branches [29].
The Support Vector Machine (SVM) plots each data item as a point in an n-dimensional environment where n represents the number of attributes. Each coordinate in an SVM plot matches the value of an attribute with a hyper-plane to differentiate the class labels. The hyper-plane segregates the classes and coordinates respective observations [26].
The K-nearest neighbor (KNN) algorithm is a non-parametric classifier that uses proximity to classify and cluster data points. The KNN uses similarity assumption to categorise similar data variables. The KNN is often referred to as a lazy learner algorithm because it only learns from the training data during classification [7].

Ensemble algorithms
Ensemble learning, primarily a multi-classifier system, combines several machine learning models in the prediction process [37]. The ensemble method uses a base-learner and an inducer algorithm to train the dataset. This meta approach of utilising a base-learner and an inducer algorithm seeks to produce an optimal predictive classifier [39]. The base-learner is referred to as homogeneous ensembles when a single base learning algorithm is employed and as heterogeneous ensembles when multiple learning algorithms are used [39].
Bagging ensemble is a parallel classifier that combines bootstrap and aggregation. The bootstrapping functions randomly select and replace training data instances, which helps reduce data variance and prevent overfitting [1].
Adaboost is a boosting-based classifier that decreases bias and variance by converting weaker learners to strong learners. Before classification, the boosting classifier assigns equal weight to all data points. After classification, the weight of instances incorrectly classified is increased while the weight of instances correctly classified is decreased [44].
Random Subspace algorithm combines predictions from various decision trees trained on distinct subsets of the training dataset. In a Random Subspace classifier, diversity is introduced by varying the columns used to train instances of the ensemble randomly [44].

Results and analysis
This section presents experimentation and training in Weka using supervised learning and ensemble classifiers. The tenfold and fivefold cross-validation techniques are implemented to resample and evaluate the algorithms through training, testing, and validation.

Tenfold cross-validation experiments
The tenfold cross-validation utilised a 90-10% training and testing ratio iteratively on the dataset. As shown in Table 2, sixteen machine learning algorithms were compared for accuracy using the Unigram, Bigram, and Trigram N-gram sequence under three text weighting schemes TP, TF and TF-IDF.
The results from classification show a dominant performance for AdaBoost (SVM) ensemble machine learning algorithm under all three weighting schemes using Unigram. The conventional SVM still performed with the highest accuracy under Unigram compared to the other traditional machine learning algorithms. The classification accuracy decreased to low levels under Bigram and Trigram feature selection methods. The AdaBoost (KNN) was the worst-performing classifier in terms of accuracy using the tenfold cross-validation technique, as illustrated in Table 2.
The F-measure score, the harmonic mean of precision and recall, is depicted in Table 3. Adaboost (SVM) ensemble algorithm under Unigram for the three weighting schemes has the highest F1 score of 0.782, indicating a good classification

Research
Discover Artificial Intelligence (2023) 3:29 | https://doi.org/10.1007/s44163-023-00073-5    compromise between precision and recall. The traditional SVM algorithm under Unigram performed as the second best classifier with an F1 score of 0.781. The ROC value indicates a trade-off between specificity and sensitivity. The resulting ROC value shows the classifier's ability to differentiate between the seven classes from the confusion matrix correctly. As illustrated in Table 4, Bagging (RF) under Unigram (TP) has the highest ROC value of 0.955. For Bigram and Trigram, the traditional RF algorithm and the RandomSubSpace (RF) performed equally well.

Fivefold cross-validation experiments
The same experiments were repeated using the fivefold cross-validation technique to compare and ascertain classifier performance. As depicted in Table 5, ensemble AdaBoost (SVM) still performed with the highest accuracy under Unigram (TP, TF, TF-IDF). The conventional SVM algorithm follows as the second-best classifier in terms of accuracy under the same weighting schemes in Unigram. The AdaBoost (KNN) is the worst-performing classifier under the Trigram feature selection mechanism. The performance of the fivefold cross-validation approach is consistent with that of the tenfold cross-validation technique for the best and worst performing algorithms.
As shown in Table 6, the AdaBoost (SVM) ensemble machine learning algorithm under Unigram (TP, TF, TF-IDF) performed with the highest F1 score of 0.769, whiles the conventional SVM performed with a score of 0.766.
As depicted in Table 7, the Unigram (TF, TF-IDF) of the Bagging (RF) ensemble machine learning algorithm has the highest ROC value of 0.951.

Comparative analysis, accuracy
The classification results for machine learning metrics accuracy, F score and the ROC value show a compelling trend between the conventional machine learning algorithms under the tenfold and the fivefold cross-validation techniques. This aspect of the analysis focuses on the best feature selection mechanism, Unigram, with respective weighting schemes (TP, TF, TF-IDF) under the cross-validation techniques.

RQ1:
Which machine learning algorithm has the highest accuracy in developing an automatic instructor-assisted question classification system?

Comparative accuracy for the conventional algorithms
As depicted in Fig. 3a and b, for the conventional machine learning algorithm, the SVM from the tenfold cross-validation has the highest accuracy of 78.46 compared to 76.91 for the fivefold SVM.

Comparative accuracy for the ensemble algorithms
As depicted in Fig. 4a and b, for the ensemble machine learning algorithm, AdaBoost (SVM) from the tenfold crossvalidation has the highest accuracy of 78.55 compared to 77.18 for the fivefold AdaBoost (SVM).

Summary of accuracy
In response to RQ1, as shown in Fig. 5, the ensemble AdaBoost (SVM) has the highest accuracy of 78.55 compared to 78.46 for the conventional SVM using the tenfold cross-validation technique.

Comparative analysis, F-measure
The F-measure is the weighted average of precision and recall. The F-score metric is essential since the text classification generated balanced data between precision and recall.

Comparative F1-score for the conventional algorithms
The F-measure score for the conventional machine learning algorithms, as depicted in Fig. 6a and b, shows that the SVM from the tenfold cross-validation has the highest value of 0.781. In contrast, the F-measure value for the fivefold is still SVM, with a score of 0.766.

Comparative F1-score for the ensemble algorithms
In Fig. 7a and b, the AdaBoost (SVM) using the tenfold cross-validation method has the highest F1-score of 0.782 compared with 0.769 for AdaBoost (SVM) using the fivefold.

Summary of F-score
In response to RQ2, the F-measure score of AdaBoost (SVM) for the tenfold cross-validation method outperformed the traditional SVM with a 0.001 margin. This is depicted in Fig. 8.

Comparative analysis, ROC Value
The ROC curve shows how much the model can distinguish between the labelled classes. The ROC curve plots the TPR against the FPR with an AUC measure of the two-dimensional area. Research Discover Artificial Intelligence (2023) 3:29 | https://doi.org/10.1007/s44163-023-00073-5 1 3

Comparative ROC value for the conventional algorithms
In Fig. 9a and b, the RF algorithm has the highest ROC value of 0.951 for Unigram (TF) using the tenfold crossvalidation technique, while the SVM is second with a value of 0.928 across all the weighting schemes of Unigram.

Comparative ROC value for the ensemble algorithms
In Fig. 10a and b, the Bagging (RF) for Unigram (TP) has the highest ROC value of 0.955 using the tenfold cross-validation. The ensemble for SVM did not perform well. The highest value of 0.944 for ensemble SVM was obtained by RandomSub-Space using the Unigram (TP).

Summary of ROC value
In response to RQ2, as shown in Fig. 11, the ROC value of the Bagging (RF) ensemble algorithm using the tenfold crossvalidation outperforms the conventional RF algorithm.

Discussion and findings
The results demonstrate that the tenfold cross-validation outperforms the fivefold cross-validation method for all the classification metrics, accuracy, F1-score and ROC value. The Unigram, Bigram and Trigram feature selection methods implemented across the three weighting schemes (TP, TF, TF-IDF) still show better performance for the tenfold crossvalidation techniques. Secondly, the ensemble machine learning algorithms outperform the conventional algorithms when implemented with Unigram and Bigram feature selection. The Trigram, however, shows a better performance for the conventional machine learning algorithms across the classification metrics when compared to the ensemble methods. The ensemble algorithm's primary drawback is the length of the training time. The ensemble classifiers took a significant time to train the dataset during simulation. The accuracy for both conventional and ensemble algorithms decreased as the N-gram increased from Unigram to Trigram. The weighting schemes also did not significantly affect accuracy, even though they adversely delayed the classifier's training. The F1-score across the tenfold and the fivefold The simulation results indicate that AdaBoost (SVM) ensemble algorithm has the highest accuracy of 78.55 using the tenfold cross-validation technique in Unigram (TP, TF, TF-IDF). In addition, the AdaBoost (SVM) has the highest F1-score of 0.757. The ROC value, however, shows a higher score of 0.955 for the Bagging (RF) ensemble algorithm using the tenfold cross-validation technique under Unigram (TP), whiles AdaBoost (SVM) has a ROC value of 0.903.
The study by Onan [25] compares ensemble and traditional machine learning algorithms for Unigram, Bigram and Trigram under the weighting schemes TP, TF, and TF-IDF. Even though the study was limited to the tenfold crossvalidation method, the results indicate a superior performance for the ensemble algorithms over conventional algorithms. In addition, Onan [25] simulation shows a slightly decreased accuracy and F1-score from Unigram to Trigram across the traditional and the ensemble machine learning methods. The findings of Onan align with our study, where ensemble methods dominated conventional algorithms for accuracy, F1-score and ROC value. Even though Hassan et al. [12], whiles comparing conventional KNN, RF, and SVM, show a dominant performance for KNN, the study we conducted shows KNN as the worst performing algorithm.

Conclusion and future work
In this study, we demonstrated the dominance of the tenfold cross-validation technique over the fivefold methods and significantly showed the impact of the weighting schemes for question classification. Although the study utilised data from Ghana, results show a trend where ensemble algorithms outperformed conventional ones. The discussion aspect of the study is narrow since the data sources are different, and most of the reviewed literature focused on the tenfold cross-validation method without the weighting schemes and the N-gram features. Comparing ensemble methods and deep learning algorithms is an element of the research for future studies.