1 Introduction

The course feedback from students has long been utilized from individual instructors to groups and institutions for a variety of purposes. Instructors can utilize the course feedback in order to find out what is important to students, the effectiveness of teaching material and methods, etc., in order to improve the course for a future offering. Institutions can use student surveys to gauge student perceptions and opinions, and even towards evaluations of instructors. Quantitative data from surveys, such as student ratings of their course instructors, have been used for a long time in an effort to measure teaching effectiveness: “student ratings are the single most valid source of data on teaching effectiveness - in fact there is little support for the validity of any other source of data” Spencer and Schmelkin (2002).

Besides the quantitative data, there are also qualitative data in the form of textual responses, which are not as easy to explore, summarize, or visualize. Many issues exist with student text comments, such as misspellings, abbreviations, short or irrelevant statements, rambling, etc. However, because of the open-ended nature of the feedback, students are allowed to describe what is on their mind and what they feel is important without necessarily thinking of ratings or matching quantitative concepts such as a Likert scale. At the same time, the issues associated with understanding human language at scale have kept many from fully utilizing this resource.

In the past decade or so, there has been an increasing amount of research using text-based student course feedback for a variety of tasks and purposes. With the ubiquitous web-based platforms and social media, there is also an abundance of data to collect and analyze, for example, from twitter Chen et al. (2014) or ratemyprofessor website Onan (2020). Many efforts are focusing on sentiment analysis, which is the field of study that analyzes people’s opinions, sentiments, attitudes, and emotions in text. There has been a lot of research using sentiment analysis on educational data, for example see the surveys in Dolianiti et al. (2018); Zhou and Ye (2020). Just like sentiment analysis has been used by businesses to help improve marketing and customer relationships, in the educational field, sentiment analysis may be used to improve the “attractiveness of higher educational institutions” Santos et al. (2018) or decrease drop-out rates in Massive Open Online Courses (MOOCs) Kastrati et al. (2021).

Besides sentiment analysis, there are other tasks that have been explored in related research, for example topic modeling or topic-based classification: the goal here is not to extract the sentiment (positive or negative), but to extract or predict the topics for which the comments were written or focused on, for example Van et al. (2018); Srinivas and Rajendran (2019). There is also work in aspect-based sentiment mining, which targets the sentiment for each specific entities (aspects) in the comments, for example Sindhu et al. (2019); Ren et al. (2022).

Much of the earlier work related to course feedback analysis has concentrated on traditional Machine Learning (ML) techniques, for example Altrabsheh et al. (2014); Koufakou et al. (2016). In the last few years, researchers have taken advantage of the advances in ML and Natural Language Processing (NLP) and used Deep Learning (DL) models, for example utilizing word embeddings, and convolutional or other deep neural networks; as examples, see Dessì et al. (2019); Onan (2020); Estrada et al. (2020). There are also recent educational data mining surveys Dutt et al. (2017); Kastrati et al. (2021).

In this paper, we first describe the process of collecting and annotating a corpus of more than ten thousand online reviews for a variety of courses with topics ranging from Web Development to Data Science to Marketing. We applied sentiment polarity extraction on our corpus, looking at reviews as positive or negative. We also explored topic-based classification, categorizing each review in one of four topics. We performed extensive comparative analysis of several DL techniques (such as CNNs and LSTMs, using word embeddings, but also state-of-the-art models, BERT, RoBERTa, and XLNet) and compared their efficacy with traditional classifiers (k-Nearest Neighbor, Naïve Bayes, and Support Vector Machines SVMs). Besides our extensive experimentation with very different classifiers, we also explored further possible improvements on the accuracy and the effects on runtime efficiency of the DL models. The main contributions of our work are:

  1. 1.

    We utilized a brand new corpus we collected from over ten thousand course reviews online. Using the new corpus, we presented a two-fold analysis (opinion-based and topic-based) as opposed to many previous works who focused only on one task. This way, our work can demonstrate how to employ the data for different tasks and highlight similarities and differences between the two tasks. For example, in our experiments for topic-based classification, we found that an SVM model performed better than the DL models, which was not the case for the opinion mining experiments.

  2. 2.

    We utilized and experimented with, not only traditionally-used Deep Learning models, such as CNNs and RNNs, but also state-of-the-art NLP transformer-based models, namely BERT, RoBERTa, and XLNet. BERT (and as a consequence any similar models) has quickly become the de facto baseline in NLP experiments Rogers et al. (2020). Our literature review in the area of course feedback analysis found only a handful of papers that used BERT and none that used RoBERTa or XLNet (see Section 2).

  3. 3.

    We reported the performance results and observations from extensive experiments with diverse models (traditional, deep learning, and transformer-based) we employed for classification. Our experimentation is rigorous and built on a solid framework, for example we used cross-validation, we reported several metrics, we compared confusion matrices, among other things. Additionally, we explored how to improve the accuracy of our DL models, while looking at the relation of the improved accuracy on runtime. We have not seen any similar exploration in related work for course feedback analysis.

The organization of this paper is as follows. In Section 2, we review previous work related to ours. In Section 3, we describe the corpus we developed and used in this study. In Sections 4 and 5, we provide a detailed view of our models followed by our experiments and results. Finally, in Section 6, we summarize our work and provide concluding remarks.

2 Related work

There has been an ever increasing amount of research in using text mining and NLP for educational purposes due to the increased processing power, abundance of datas and the recent advances in ML and NLP models. In the following, we review traditional techniques applied to the analysis of student course reviews, followed by current work in this area, namely using DL.

We also provide a summary of representative related work in Table 1: the Table lists the ML techniques and what types of data were used in each reference, listed by the first author and year of publication for space.

2.1 Traditional techniques

Table 1 Summary of Representative Related Works. References listed as first author, year, in chronological order

In the previous decade, there are several articles in the literature that have used traditional techniques and tools to mine student feedback. For a thorough review of earlier work, see surveys such as Peña-Ayala (2014). In the following, we give an overview of representative works in this area based on traditional techniques.

Sliusarenko et al. (2013) used key-phrase extraction and factor analysis to identify which factors were important in student comments; they also employed regression analysis to find which factors have the most impact on student ratings. Altrabsheh et al. (2014) applied several pre-processing techniques to data they collected and then machine learning algorithms for sentiment analysis, finding that the best method was the Linear SVM using unigrams. Ortigosa et al. (2014) performed sentiment analysis by combining lexicon-based techniques with ML models such as Decision Tree, Naïve Bayes and SVM. Tian et al. (2014) recognized emotions (such as anger, anxiety, or joy) in Chinese texts from e-learners and proposed a framework for regulating the e-learner’s emotion based on active listening strategies. Koufakou et al. (2016) explored sentiment analysis using traditional techniques based on bag-of-words, Naïve Bayes and k-Nearest Neighbor, as well as Frequent Itemset Mining to identify key frequent terms in survey comments. As more recent examples, Lalata et al. (2019) employed an ensemble of traditional ML algorithms, specifically, Naïve Bayes, Logistic Regression, SVMs, Decision Tree and Random Forest. Hujala et al. (2020) used an LDA (Latent Dirichlet Allocation) method, and then applied qualitative and quantitative evaluation methods to validate the outcomes by connecting them to theoretical frameworks and quantitative data.

As to the type of data that has been used, researchers have collected data e.g. from social media, or used data from their own courses. As examples, researchers have collected real-time student feedback from lectures as well as end-of-semester surveys Altrabsheh et al. (2014); survey data from courses at their department Koufakou et al. (2016); students’ conversations on twitter to understand students’ opinions about the learning process Chen et al. (2014); facebook data, for example messages on the user wall and basic user data such as gender and birthday Ortigosa et al. (2014). More recent examples: Hujala et al. (2020) used over six thousand student survey results carried out at a Finnish University; Abbas et al. (2022) used student evaluations of more than five thousand teachers from a University in Mexico.

Researchers have also shown how to use the extracted sentiment to predict student performance. For example, Sorour et al. (2015) applied probabilistic latent semantic analysis (PLSA) on student comments collected after specific lessons in introductory programming courses. Then, they predicted student final grades using SVM and artificial neural networks (ANN), where the SVM had the highest accuracy.

Besides applying ML techniques to student feedback, there has also been work on developing tools and frameworks for analysis of student feedback. For example, a conceptual framework for student feedback analysis by (Gottipati et al., 2017) included a sentiment extraction stage and logistic regression. Grönberg et al. (2021) proposed an open-source online text mining tool for analyzing and visualizing student feedback entered in course surveys at a university. Estrada et al. (2020) proposed an emotion recognition and opinion capture as part of an integrated learning environment for Java. Srinivas and Rajendran (2019) used LDA for topic modeling and the Vader tool (proposed by Hutto and Gilbert (2014)) for sentiment classification, as part of a larger systematic view on strengths, weaknesses, etc. to be used by universities to analyze online feedback.

2.2 Deep Learning (DL)

DL is a more recent advancement in the larger field of machine learning and it has already been used for educational data mining successfully Doleck et al. (2020). DL models are usually Artificial Neural Networks (ANNs) with more layers than traditional ANNs. They include networks such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), etc. Among the more recent neural architectures, BERT Devlin et al. (2019) is considered state-of-the-art in several NLP tasks. We review the basics of the models we use in our work in Section 4.

For a thorough view on the recent work related to the analysis of student feedback, the reader is referred to recent related surveys, such as Dutt et al. (2017); Dolianiti et al. (2018); Kastrati et al. (2021). The most recent survey we found Kastrati et al. (2021) shows that there are only seven papers related to this topic that utilized DL methods, even though we found additional works dated after the survey, using DL models. In the following paragraphs, we review representative work that is related to our work.

(Yu et al., 2018) used sentiment information extracted from student self-evaluations to improve the accuracy of early prediction of which students are likely to fail in a Chinese course. They used a Chinese affective lexicon, and structured data such as attendance, in conjunction with unstructured text comments. They found that CNNs using both structured and unstructured data had the best performance overall. Another study by (Tseng et al., 2018) also focused on course surveys with the task to use student comments for evaluating and hiring teaching faculty. They compared deep networks such as Recurrent Neural Networks (RNNs) using a Chinese text sentiment analysis kit, named SnowNLP, and they found that the best accuracy was achieved by an attention LSTM classifier.

There is a branch of related work based on Vietnamese data. Van et al. (2018) developed a Vietnamese Students’ Feedback Corpus named UIT-VSFC, human-annotated for classification based on sentiment and on topics. Nguyen et al. (2018) explored variants of LSTMs for sentiment analysis on that corpus. Truong et al. (2020) utilized PhoBERT, a pre-trained BERT model for Vietnamese, and fine-tuned it to achieve state-of-the-art results on UIT-VSFC.

Dessì et al. (2019) experimented with several word embedding representations and DL as well as traditional models, for regression based on a sentiment score rating. They found that the best performance is achieved by their Bidirectional LSTM with an attention layer, based on word2vec. They also explored training word embeddings on a relevant corpus (not available as far as we see).

Estrada et al. (2020) presented sentiment analysis and emotion detection of online data from various sources, such as youtube or twitter, as well as data collected from their own courses. They utilized different models such as CNN and LSTM, as well as BERT and an evolutionary model; the latter performed the best in their experiments. Their DL models used as input one-hot encodings of the student comment text, and not word embeddings. Onan (2020) also focused on sentiment analysis and experimented with various embeddings (word2vec, GloVe, fastText, LDA2Vec) and models such as CNN, RNN, LSTM, as well as ensemble techniques. Their experimentation is very thorough and they used more than a 150 thousand reviews collected from ratemyprofessor website.

There is also work that has focused on aspect-based sentiment analysis, which targets sentiment related with a specific aspect, for example, instructor or course. Sindhu et al. (2019) applied a two-layer LSTM with the goal of aspect-based sentiment analysis on their own University data as well as SemEval-2014 data. The first layer predicted aspects from the feedback, while the second predicted the sentiment polarity. Kastrati et al. (2020) used more than a 100 thousand reviews from coursera as well as classroom feedback. They applied LSTMs and CNNs using various word embeddings. Ren et al. (2022) used Chinese open-ended comments written by junior school students. They constructed dictionaries for topics and sentiments which were used for their deep learning model to predict sentiments.

From our review, the research in analysis of student feedback has not fully embraced the state-of-the-art models, namely BERT and its extensions such as RoBERTa or XLNet. To start with, we already cited Estrada et al. (2020); Truong et al. (2020) earlier in this section. Rybinski and Kopciuszewska (2020) compared BERT models on 1.6 million student evaluations from the US and the UK, extracted from different sources. Wang et al. (2020) used subtitles (captions) in videos of more than thousand courses to predict the performance of the instructor in online education, using a hierarchical BERT model based on teacher’s verbal cues and on course structure.

In summary, even though there has been a considerable amount of research work in student feedback analysis, there is still a gap of utilizing the recent state-of-the-art models such as BERT, which our paper aims to fill. In our review, we noted that most of the work focuses on sentiment analysis, while our work also examines topic classification. We also noted that several works did not report metrics other than accuracy (for example, Estrada et al. (2020)) or they did not describe their DL models or (hyper)parameters (for example, Tseng et al. (2018)). We presented extensive experimentation for two different classification tasks, with various DL models, exploring the effect of hyperparameters, and discussed runtime efficiency as well as classification accuracy.

3 Dataset description

For this work, we collected publicly available course reviews posted online for bootcamp-type courses at website https://www.coursereport.com. The reviewed courses were for various topics, from assembly language, to web development, to marketing, and the courses were offered online or in different cities globally. The vast majority of the reviews were in English. The data contained a course title, a course review (text comments), a review rating (1 through 5), and other fields we did not use, for example, username or instructor rating or helpfulness of the review. We used a web crawler we developed from scratch to collect data for the reviews. We pre-processed the text reviews to clean up invalid text: removed remaining HTML tags and any reviews that are shorter than 2 words. The resulting dataset had 10,610 reviews (text comments). The review text length ranges from a minimum of 2 words to a maximum of 4,219 words, with an average of 245 words and a standard deviation of 251.

First, we organized the data for the sentiment polarity extraction task, as follows. As mentioned above, each review had a star rating, ranging from 1 to 5. The dataset was divided into positive reviews and negative reviews as follows: the positive comments were considered the ones which had rating score 4 or 5, while the negative comments had score ratings 1-3. The entire dataset ended up very imbalanced: it contained 91.5% positive and 8.5% negative reviews. Figure 1 shows the pre-processing and labeling steps. Table 2 shows examples of comments taken from positive and negative reviews.

Fig. 1
figure 1

The overall pre-processing and labeling tasks for the sentiment analysis task

Table 2 Example comments from course reviews in our data and their label
Table 3 Top ranking features (trigrams) for positive and for negative course reviews

Finally, we examined the top ranking trigrams for the positive versus the negative reviews, shown in Table 3 (we also looked at unigrams and bigrams but they were not as descriptive of the polarity between the reviews as the trigrams). In Table 3, one can see that “web development course” is a top ranking term for both types of reviews. We observed that this course is the most frequent course topic overall, therefore it appears very frequently in both negative and positive reviews. Other terms such as “waste time money", “dont waste time", “free online resources" rank high in negative reviews, while positive reviews have terms such as “highly recommend course" and “life changing experience".

For the topic-based classification, first we looked at the course titles and their reviews, and we conducted different visualizations; for example, see the word cloud in Fig. 2. Additionally, we utilized Latent Dirichlet Allocation (LDA)Blei et al. (2003) and we identified the major course topics were, by far, Web development, Programming, and Data Science. We also manually filtered reviews based on similar course titles and then grouped together these courses in one topic or category. Finally, we dropped the rest of the reviews that did not have any course name or identifiable topic, and ended up with 7,503 reviews. The topics and their distribution in the resulting data is shown in Table 4. As shown in Table 4, the large majority of the courses are related to programming or web development.

4 Methodology

The general process for classification using an ML algorithm in our work is shown in Fig. 3. The data is split into training and test set, each including the text and the labels (for example, positive or negative). The training data is used to extract the vocabulary: the set of unique tokens or words found in the training data. Based on the vocabulary, we then extract the features that train the model (more details on the features will be given in the following sections). In the prediction phase, the model generated from training is then used to predict the labels for the test inputs. For the split of the dataset into training and test sets, we used cross-validation (see Section 5.1). In k-fold cross-validation, the data is split into a total of k subsets, then the experiment is executed k times, ensuring the test set is varied in each execution.

In the rest of this section, we first present an overview of a traditional approach for classification of text, based on Bag-of-Words (BoW). We also briefly present the BoW classifiers we use in our experiments. Then, we give an overview of DL models for text classification in general, as well as the specific models we use in our work.

Fig. 2
figure 2

Word cloud of course titles

Table 4 Topics, example courses in each topic, distribution (percentage) of each topic, and total number of comments in the Topics Dataset

4.1 Traditional Bag-Of-Words (BoW) approach

First, we extracted the text from our collection of course reviews (corpus) and then tokenized the text into words (see Section 3 for more details on pre-processing). The resulting dataset was represented as a BoW matrix. In these methods, the features extracted from the data in Fig. 3 is the BoW matrix.

BoW methods do not preserve the order of the words in the text or the context of a word in a phrase, nor do they preserve or extract grammar-related or other relations: they only store frequency information for each unique word in the corpus. The dimensionality of the resulting BoW matrix is the number of documents (or unique course reviews) \(\times\) the unique words or tokens in our corpus (the vocabulary). A small detail is that the vocabulary is extracted from the training set only.

For this part of our work, we used TF-IDF (Term Frequency-Inverse Document Frequency) values. TF-IDF is a statistical measure used to evaluate the importance of a word in a document in a corpus. Using TF-IDF, the importance increases proportionally to the frequency of the word in the document, but it is offset by the frequency of the word in the corpus. We used the TF-IDF features as input to three classification models with the goal either to predict if a course review comment is negative or positive (sentiment analysis) or to detect one of the four topics (topic-based classification): Naïve Bayes, k-Nearest Neighbor (k-NN), and Support Vector Machines (SVM). These algorithms are briefly described below - for more details see any related text, such as Tan et al. (2005).

Fig. 3
figure 3

The training and prediction process for classification using a machine learning algorithm

Naïve Bayes offers a probabilistic framework for solving classification problems. Naïve Bayes first uses the training data (corpus) to find the probability of each unique word as it occurs in the corpus for each class. For a test document, Naïve Bayes multiplies the pre-calculated probabilities of every word in the document and then chooses the class with the highest probability to classify the test record.

In the k-Nearest Neighbor algorithm, given a course review x and a user parameter k, the algorithm finds the k reviews that are the most similar to x. These are called its k-nearest neighbors. Then, based on the majority of the labels of x’s k-nearest neighbors, the algorithm predicts the label of x.

SVMs have been very successfully applied to a number of applications since their inception, including analysis of course feedback (see Section 2). The SVM algorithm finds a hyperplane (decision boundary) that separates the classes by their features over a space Cortes and Vapnik (1995). The goal is to maximize the margin, or create the largest possible distance between the separating hyperplanes, in order to reduce the upper bound on the expected generalization error. For non-linearly separable data, the solution is to map the inputs into a high-dimensional feature space.

4.2 Deep Learning (DL) approach

4.2.1 Word embeddings

As already mentioned in the previous section, traditional methods for representing words in matrix form, such as BoW and TF-IDF, do not take into account the position or the context of the word in the document. Recent approaches proposed word embeddings that represent the semantic meanings of words Mikolov et al. (2013). Words that are similar in meaning or in context are closer to each other in the vector space, while words that are different are farther apart. In this work, we experimented with Word2Vec Mikolov et al. (2013), which uses a feed-forward neural network to predict the neighboring words for a given word in order to create the word embeddings. The embedding for each word is essentially a one-dimensional vector of d values, where d is a user-entered parameter.

4.2.2 Deep neural networks

In contrast to traditional techniques in Section 4.1, more recent approaches use DL, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), to learn text representations. In the following, we give a brief background review on the models we used in our work; the reader is referred to (Minaee et al., 2021) for a comprehensive review on DL-based Text Classification. We also list the models we used for this work, and provide a detailed step-by-step example for a convolutional model we employed for our work.

Originally invented for computer vision, CNN models have subsequently been shown to be effective for many NLP tasks Kim (2014). CNNs utilize layers with convolving filters. In text related tasks, the filters are trained to identify word combinations that are most pertinent to the classification task at hand. In most of the recent literature, the word embeddings from the document are fed into the NN as features. In addition, as character-based CNNs have been shown to work for text related tasks Zhang et al. (2015), we briefly experimented with a character-based CNN.

RNNs were also shown to be effective in NLP tasks due to their architecture specifically designed to address time-series data. In NLP tasks, RNNs are aiming to learn linguistic patterns based on different sequences of words. Basic RNNs are unable to retain information and find relationships over a large sequence of words so we used LSTMs: Long Short-Term Memory units (LSTM) Hochreiter and Schmidhuber (1997) use gating functions to selectively store or “forget” input information according to how relevant it is to the classification task. Finally, we also experimented with a Bidirectional LSTM model, which ensures that the network can account for the preceding as well as the following context when processing the sequence of words.

An example of a convolutional model we employed based on word embeddings is shown in Fig. 4. As before, the collection of documents or course reviews was tokenized into words, but now we padded or shortened each resulting review to a set length (or number of words), based on a user-entered parameter called maxlen. We extracted the vocabulary from the resulting text sentences, and then created word embeddings of length d. The result from the embedding layer was a three-dimensional matrix, of dimensionality \(n \times \textit{maxlen} \times d\), where n is the vocabulary size and d is the embedding dimension.

The embeddings were fed into the convolutional layer. The output of the convolutional layer was fed into a dropout and a max-pooling layer. There might be more than one convolutional layer employed in this model; if so, the outputs were concatenated before the next layer. Finally, we used a fully-connected dense layer to output the prediction of the model (the predicted label). This layer used a sigmoid or a softmax activation, depending on the label being binary (sentiment) or categorical (topic), respectively. Our LSTM or bi-LSTM models follow a similar general idea.

Fig. 4
figure 4

A depiction of a CNN-based model we used for classification

4.2.3 Transformer-based models

While word embeddings take into consideration the semantic similarities of words in a corpus, they do not explore different meanings of words based on context. Therefore, in word embeddings such as word2vec Mikolov et al. (2013), each word in the vocabulary will have one single embedding. More recent techniques introduced contextualized embeddings: they encode a word and its context from the words before it, and after it, so it “will generate a different embedding vector for the word ‘bank’ in ‘bank account’ to that for ‘river bank’ ” Rybinski and Kopciuszewska (2020).

BERT (Bidirectional Encoder Representations from Transformers) Devlin et al. (2019) is considered state-of-the-art in several NLP tasks. For example, in a recent SemEval Task for detecting offensive language Zampieri et al. (2020), the vast majority of the top entries in the task used BERT-like systems. A recent survey found that “in a little over a year, BERT has become a ubiquitous baseline in NLP experiments” Rogers et al. (2020). A transformer combines a multi-head self-attention mechanism with an encoder-decoder. BERT utilized Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, the model masks some of the words, and uses the rest of the words to predict the masked words. In NSP, given two sentences, BERT was trained to predict if the second sentence is likely to follow the first sentence. For more information on the internal architecture of BERT, the reader is referred to Devlin et al. (2019).

Utilizing BERT is somewhat similar to the models such as CNN we discussed in the previous section, with some significant differences. One of these differences is the BERT tokenizer. The BERT model needs inputs in the form of: token identifiers, masks, and segments. BERT marks the end of each sentence with a special [SEP] token. Also, BERT inserts a [CLS] token (which stands for “classification”) to the start of each sentence.

Besides these, the BERT-based model we employed is overall similar to the previous CNN model in Fig. 4. The BERT-based model also uses a maxlen for the input comments (see the definition of maxlen in the previous section for the CNN model, and Fig. 4). The results from the BERT tokenizer were passed onto the BERT layer, whose [CLS] output was fed into a dense layer that outputs the prediction of the model. Just as in the DL models from the previous section, this layer used a sigmoid or a softmax activation, for binary (sentiment) or categorical (topic) classification, respectively.

An important benefit of using a BERT-based model, over a CNN-based model such as the one in Fig. 4 is the BERT model has already been pre-trained on big data: in fact, many pre-trained models exist available for direct use or that can be fine-tuned for a specific classification task. Fine-tuning means to further train the pre-trained BERT model using our data. Section 5.1 includes the details of our BERT model and its hyperparameters.

There have been numerous models extending BERT. In our experiments, we used RoBERTa and XLNet. As main differences from BERT, RoBERTa (Robustly optimized BERT approach) Liu et al. (2014) removed the NSP and replaced the static masking (in MLM) of BERT, with dynamic masking. In summary, RoBERTa has been shown to be more robust than BERT, it modified some of BERT, and it was trained using more data. XLNet Yang et al. (2019) was based on an auto-regressive model, which predicts future behavior based on past, and used a Transformer-XL. XLNet also introduced permutation language modeling, where all tokens (not only masked tokens) are predicted in random order, rather than sequential.

5 Experiments and results

5.1 Experimental setup

As discussed in Section 3, the dataset we collected for the sentiment analysis is very imbalanced: it contains 91.5% positive and 8.5% negative reviews. Therefore, for our sentiment-based classification, we used stratified 10-fold cross validation (CV). For the topic-based classification, we used stratified 5-fold CV to better suit the four topics and their distribution (see Table 4).

In order to implement the diverse suite of classification models we utilized, we wrote our code using different tools and platforms, which resulted in quite different implementationsFootnote 1. We conducted all experiments using google colaboratoryFootnote 2. We used scikit-learnFootnote 3 for all our BOW experiments and kerasFootnote 4 for our implementations of the DL models. All approaches based on TF-IDF were run with the scikit-learn defaults and unigrams. For the NN experiments we used 5 epochs, 0.01 learning rate, 32 batch size, Adam optimizer, and 0.5 dropout. For the convolutional layer, we used rectified linear units (ReLU), and filter windows of 3, 4, or 5. For LSTMs, we used 64 units. For character embeddings, we used 16 as the embedding dimension.

For our experiments with word embeddings, we first used the publicly available word2vec vectors that were trained on 100 billion words from Google NewsFootnote 5. The vectors had dimensionality of 300 and were trained using the continuous bag-of-words (CBOW) architecture Mikolov et al. (2013). Words not present in the set of pre-trained words were initialized randomly. In our results, the models that used these pre-trained vectors are denoted as “Pre-trained". We also experimented with non pre-trained word vectors, i.e. vectors that were randomly initialized.

Finally, for the experiments with the transformer models, we used Pytorch and the corresponding models provided by HuggingFaceFootnote 6. Specifically, we used the bert-base-uncased, the roberta-base and the xlnet-base-cased models, all with lower case option. Based on performance in early experiments and following the recommendations by the original developers of BERT Devlin et al. (2019), our transformer models used learning rate of \(2e^{-5}\), 3 epochs, batch size 32, and maxlen of 50.

We reported our results based on the classification metrics defined below:

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(1)
$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(2)
$$\begin{aligned} Accuracy = \frac{TP+TN}{N} \end{aligned}$$
(3)
$$\begin{aligned} F1\text {-}score=\frac{2 \times Precision \times Recall}{Precision+Recall} \end{aligned}$$
(4)

where TP is True Positives, FP is False Positives, FN is False Negatives, and N is the total number of records. Besides Accuracy in (3), we chose to also report the F1-macro which averages the F1-score in (4) over the classes: the macro-averaged F1 is better suited for showing algorithm effectiveness on smaller categories Altrabsheh et al. (2014); Kastrati et al. (2021), which is important as we are working with imbalanced datasets.

5.2 Results and discussion

5.2.1 Sentiment analysis

The results for the sentiment analysis task are shown in Table 5. As shown in Table 5, the transformer-based models performed the best overall: RoBERTa is the top performing model at 95.5% accuracy and 84.7% F1-macro, while BERT and XLNet follow with F1-macro of about 83%. From the rest of the DL models, CNNs performed the best with the CNN using the pre-trained word embeddings performing the best at 92.1% accuracy and 82.4% F1-macro. From the TF-IDF models, accuracies were high but F1-macro results are low, around 50% for all the models we utilized.

Table 5 Results for sentiment analysis - highest is highlighted in Bold

We also performed a comparison based on maxlen input values for BERT, RoBERTa, and XLNet: see Table 6 for the sentiment analysis task. Higher maxlen values mean using more words from each comment (or a larger part of the comment) as input to the model. As shown in Table 6, all transformer models took about 3-4 minutes per epoch for maxlen equal to 100 versus under 10 seconds for the word CNN. Among the three models, RoBERTa was slightly faster than BERT at under 3 minutes per epoch and XLNet was the slowest at almost 4 minutes per epoch. This means that each of the transformer models needed about 1.5-2 hours total runtime given 3 epochs and 10-fold stratified CV. We were not able to run transformer model experiments with maxlen larger than 100 due to these long runtimes. XLNet and RoBERTa did the best in these experiments (about 97% accuracy and 89% F1-macro for maxlen=100).

Table 6 Comparison of deep learning models given different maxlen values for sentiment analysis - runtime is seconds per epoch

Table 6 shows that the CNN model also gained from an increase in maxlen while its execution is under 10 seconds per epoch. Given this observation, we also ran experiments with CNN and maxlen higher than 100. The resulting plot is shown in Fig. 6a (using regular word embeddings). As the figure shows, there was no gain for this model and the sentiment analysis task by increasing the maxlen to more than 150, and its F1-macro stayed at 80%.

Overall, for the sentiment analysis task, we see the superiority of the transformer-based models, BERT, RoBERTa, and XLNet, especially how well these models perform with imbalanced data. We also see that using more words as input to the models increases accuracy at the expense of further increasing their execution time. Another avenue we leave for future research is to use a model such as DistilBERT Sanh et al. (2019), a distilled and smaller version of BERT that has been shown to be much faster.

Finally, we examined records on which two of the top performing models, BERT and RoBERTa, disagreed on their predictions (given the same train/test split of the records). In the following, we provided a couple of reviews as examples (we edited the reviews for length and content, but made sure to preserve the spirit of each review). The following was correctly classified as positive by BERT but negative by RoBERTa: “I went in not knowing much more than how to write a simple program, and I got a good job [...] I had to spend a lot of time outside the classes to learn [...] curriculum is alright, a bit scattered [...] overall they’re competent and do a good job [...]”. Even though this review had a high star rating, as a whole it contained somewhat mixed opinions and negative wording. As a second example, RoBERTa labeled this review correctly as positive, while BERT labeled it as negative: “I’d like to respond to the one review trashing X. It’s totally wrong. X is passionate and smart, but if he sees you doing something you shouldn’t, he is not afraid to call you out. [...] I can honestly affirm that <course> was life changing [...]”. The review had an argumentative tone trying to defend an instructor, and then it was very positive of the instructor and the course. Both examples show the complexity of capturing sentiment at the paragraph (or essay) level, as researchers have previously noted Ren et al. (2022). Future research could focus on the sentence level as a way of improving the sentiment classification of such reviews.

5.2.2 Topic-based classification

The results for the topic-based classification task are shown in Table 7. For this task, a linear SVM was the top performing model at 79.8% accuracy and 80.6% F1-macro. The transformer-based models, BERT, RoBERTa, and XLNet, were closely behind at low to high 70’s for accuracy and F1-macro. The rest of the DL models were in the low to mid 60’s for accuracy. The highest performing among them was the CNN model using the pre-trained word embeddings, with 65.9% accuracy and 65% F1-macro.

Table 7 Results for topic-based classification - highest value is highlighted in Bold

Example confusion matrices for four of the models on the topic classification task are shown in Fig. 5. The models shown are Linear SVM (Fig. 5a), CNN model with regular word embeddings (Fig. 5b), CNN model with pre-trained word embeddings (Fig. 5c), and BERT (Fig. 5d). Any (hyper)parameters are the same as the ones for the results in the Table 7. For reference, the topics and their distribution in the data are shown in Table 4.

Fig. 5
figure 5

Confusion matrix for four models on topic-based classification

From Fig. 5a, it seems that SVM did very well (91%) on the larger topic (“Web Development", about 51% of the records). BERT on the other hand seems to have done the best on the smaller topic (“Non-Programming") and relatively well (mid 70’s) on the rest of the topics (see Fig. 5d). CNN models did worse overall (see Fig. 5b and c), except for the CNN with pre-trained word embeddings matched the performance of the SVM for the large topic (“Web Development") (see Fig. 5a and c).

Table 8 Comparison of Deep Learning models given different maxlen values for Topic Classification - Runtime is Seconds per Epoch
Fig. 6
figure 6

Effect of maxlen on Accuracy and F1-macro for the CNN Model (regular word embeddings) for both tasks

As we did for the sentiment analysis task (see Section 5.2.1), we also experimented with higher maxlen values for the topic-based classification task. Results are shown in Table 8. For maxlen set to 100, BERT surpassed the Linear SVM result: BERT’s F1-macro was 82.5% versus the 80.6% of the SVM (shown in Table 7). However, BERT achieved this at around 2 minutes per epoch, which then made the 5-fold stratified CV and 3 epoch execution at about half an hour.

As the CNN models are the two best of the regular DL models, we also experimented with increasing maxlen for the CNN. Figure 6b shows the effect of different maxlen values on the classification performance of the CNN model with regular word embeddings for the topic-based classification task. As can be seen in Fig. 6, as maxlen increased (which means that more of the text from the comment is used as input to the model), the accuracy and F1-macro overall tended to increase, but the increase was much more pronounced for the topic classification task (Fig. 6b) than the sentiment analysis task (Fig. 6a). For maxlen set at 300, the CNN model performed as well as the Linear SVM to classify the course topics (see Table 7). It is noteworthy that we did not observe a similar effect on the accuracy of the linear SVM by limiting number of tokens or features as the input parameter, and that the SVM has a smaller vocabulary than the CNN model. We conducted similar experiments for the LSTM and we did not see an increase in accuracy. As a note, in these results, the CV experiments led to a range of ±0.05 to ±0.07 deviation from the numbers shown in the plots.

Fig. 7
figure 7

Accuracy and Loss per epoch for the CNN Model (pre-trained word embeddings) on Topic-Based Classification

Finally, we also investigated the effect of the number of epochs for the CNN model. Figure 7 shows an example of a typical run using the pre-trained word-based CNN on the topic classification task. As can be seen from both the accuracy and loss figures, the best values for the number of epochs were 4-6, after which the network started overfitting on the training data.

Overall, we see that the topic classification task is more challenging for the models than the sentiment analysis. This was expected as the binary sentiment analysis task has been shown to be relatively more straightforward for DL models in recent years. Our empirical results also indicate that we should use a larger part of each review (in the form of the maxlen parameter) to feed as input into the DL models to improve performance. After manually exploring several reviews chosen at random, we observed that sometimes the topic-related wording did not appear until later in the comment. This verified that using a larger part of the course review would indeed result in better model performance. Finally, some of the reviews might lack topic-related terms or language, or they might have wording that is shared by more than one topic, and therefore they are more difficult to classify even by a human. For example, given the following snippets of a ‘Web Development’ course review: “This class is a good jump start into a technical career! The class has a limited amount of time and it’s really hard to go over as much content as there is, but it’s all necessary to get down the fundamentals in that timeframe. [...] is willing to help people with coding problems after it is over [...]”: this review could easily be classified into ‘Programming’ instead of ‘Web Development’ (note that none of the parts we omitted from this example included any terms or language specific to web development).

A limitation of this work is the way we selected and assigned the topics: for example, the ‘Non-Programming’ topic included reviews for courses in Digital marketing or UX Design, which made the topic not as well-focused as the rest of the topics. At the same time, the ‘Programming’ topic included reviews for courses in iOS, Android, or Full Stack Development, which might also have very different terminology. Finally, as we just discussed, the terms or expressions used in a review for a ‘Programming’ course could very well apply to a review for a ‘Web development’ course. As our dataset are available to others, future research could look into different topics that are more detailed or assigned differently.

6 Conclusions

In this study, we describe how we collected and pre-processed more than ten thousand course review comments publicly available online. We present extensive experimentation with several ML techniques to extract sentiment from the text in the reviews as well as to detect the topic of the course for which the review was written. The techniques with which we experiment included a traditional bag-of-words representation of the text as well as word embeddings and character embeddings. Our employed classification models range from traditional machine learning, such as Naïve Bayes and SVMs, to current DL techniques, based on CNNs and LSTMs. Finally, we fill a gap in the current research by exploring state-of-the-art transformer-based models (BERT Devlin et al. (2019), RoBERTa Liu et al. (2014), and XLNet Yang et al. (2019)) which have not been used extensively yet in this course review analysis field.

Our extensive experimentation with these algorithms shows how the different models behave for the two tasks. For the sentiment analysis task, the state-of-the-art transformer-based NLP models perform the best. For the topic classification task, the traditional models, such as SVMs, perform the best, though the DL models become top-performing when we increased the fraction of the course review that is fed as input to the model (using a hyperparameter called maxlen). At the same time, we provide a complete picture by showing how the state-of-the-art models require much longer execution times to achieve their results.

Sentiment analysis and topic classification can be used by educators and administrators as part of their assessment process in order to continuously improve instruction delivery and address issues. Our empirical results, exploration, and discussion can serve to guide others in the analysis of their own course feedback data. Our data and models could be used by others in their own course feedback analysis. Future research goals are to further explore our data towards aspect-based sentiment analysis. We also plan to explore other features in the data, such as helpfulness of the reviews, and explore the use of additional pre-trained models such as EduBERT Clavié and Gal (2019). Finally, we would like to explore the applicability of our pre-trained DL models on other student feedback data.