Skip to main content
Log in

Deep neural networks for the automatic understanding of the semantic content of online course reviews

Education and Information Technologies Aims and scope Submit manuscript


The rise of massive open online courses (MOOCs) brings rich opportunities for understanding learners' experiences based on analyzing learner-generated content such as course reviews. Traditionally, the unstructured textual data is analyzed qualitatively via manual coding, thus failing to offer a timely understanding of the learner’s experiences. To address this problem, this study explores the ability of deep neural networks (DNNs) to classify the semantic content of course review data automatically. Based on 102,184 reviews from 401 MOOCs collected from the Class Central, the present study developed DNN-empowered models to automatically distinguish a group of semantic categories. Results showed that DNNs, especially recurrent convolutional neural networks (RCNNs), achieve acceptable performance in capturing and learning features of course review texts for understanding their semantic meanings. By dramatically lightening the coding workload and enhancing analysis efficiency, the RCNN classifier proposed in this study allows timely feedback about learners’ experiences, based on which course providers and designers can develop suitable interventions to promote MOOC instructional design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4

Data availability

The data are available upon reasonable request from the corresponding author.






Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Di Zou.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: The technical details about the DNNs used for training and testing classifiers for review topic classification

Appendix: The technical details about the DNNs used for training and testing classifiers for review topic classification


FastText, as a fast algorithm (Joulin et al., 2017) for learning text representations for classification, is an implementation of word2vec, which adopts a skip-gram model for the identification of relationships between words through word-usage pattern analysis. FastText is also regarded as a sub-word model capable of integrating sub-word information into embedded learning (Schmitt et al., 2018). Its model architecture consists of hierarchical softmax and feature hashing. Briefly, FastText “takes the sentence represented as n-gram features and embeds these features using an embedding layer”, and “the embeddings of the n-gram features are then averaged to form the final representation of the sentence and are projected onto the output layer” (Paul et al., 2017, p. 1589). Given that with many categories, it would be computationally expensive to train linear classifiers, to reduce efficiency, Joulin et al. (2017) utilized a hierarchical softmax with the basis of the Huffman coding tree. The hierarchical softmax is “advantageous at test time when searching for the most likely class” (Joulin et al., 2017, p. 428). Each node is related to a likelihood indicating the possibility of the path from the root to the node. In other words, the possibility of a node ought to be lower than that of its parent. The exploration of “the tree with a depth first search and tracking the maximum probability among the leaves allows [users] to discard any branch associated with a small probability” (Joulin et al., 2017, p. 428).

FastText uses a shallow neural algorithm for text classification just like the continuous bag-of-words approach. However, rather than predicting the word based on its neighbors, FastText predicts the target label based on the sample’s words (Rosenthal et al., 2021). FastText adopts “a bag of \(n\)-grams as additional features to capture some partial information” (Minaee et al., 2020, p. 4) about the local word order and then transforms them into low dimensional vector space. In this way, the features can be shared across different categories regardless of their lexical differences. Thus, it can “not only learn similar embeddings for word forms sharing a common stem but also generate embeddings for unseen words in the test set by combining the learned character n-gram embeddings” (Schmitt et al., 2018, p. 1111). Because of this unique feature, FastText is notably efficient in practical usage and can also achieve comparable performance to approaches with explicit word order usage (Rospocher, 2022), for example, the expensive computation of bag-of-words with additional consideration of word order.

FastText also enables word vector updating via back-propagation during model training by allowing the model to fine-tune word representations following the target task (Bojanowski et al., 2017). Because of the efficient implementation and an optimized learning rate schedule, FastText processes “input text at a speed of several orders of magnitude [faster than] that of ConvNets” (Zhang & LeCun, 2017, p. 6). Although with simple architecture, FastText has been considered effective and efficient in diverse text classification tasks (Qiao et al., 2018) and is usually “at par with deep learning classifiers in terms of accuracy, and much faster for training and evaluation” (Jha & Mamidi, 2017, p. 12). It is especially effective when performing tasks without the need for extensive hyperparameter tuning (Rosenthal et al., 2021). More algorithmic details of FastText are described by Joulin et al. (2017).


CNNs adopt a mathematical operation called convolution over feature matrix, a specialized kind of linear operation (Goodfellow et al., 2016). CNNs were originally proposed for computer vision and have gradually been considered effective for addressing issues concerning NLP, such as semantic parsing, sentence modeling, and search query retrieval. TextCNN is adopted particularly for the extraction of sentence features.

In a TextCNN, a one-dimensional convolution operation is used to produce new features, \(c\), using a filter \(w\in {\mathbb{R}}^{h\times k}\) on a window of words \({x}_{i:i-h+1}\), where \(c_i=f(w\otimes x_{i:i-h+1}+b)\), in which \(b\in {\mathbb{R}}\) donates the bias term, and \(f(\cdot )\) indicates a non-linear function. Each filter produces a feature vector \({\varvec{c}}={[{c}_{1},{c}_{2},...,{c}_{m}]}^{\mathrm{T}}\) with padding. Then a max-over-time pooling operation is implemented over each feature map, capturing the maximum value \(\widehat{c}=max({\varvec{c}})\) as the corresponding feature to the filter. The CNN model can capture various features using multiple filters, and the features are passed to a fully connected layer to output logits, \(a\in {\mathbb{R}},\) for label prediction and computing loss.


Recurrent Neural Network (RNN) (Liu et al., 2016), deals with a variable-length sequence input through a recurrent hidden state with activation at each stage depending upon that of the former time. In RNN, Bi-LSTM is adopted following the same settings as previous studies (Zhou et al., 2015). In each LSTM cell, the transition functions can be defined as Eq. (8), where \(\sigma\) is the sigmoid function, \(tanh\) is the hyperbolic tangent function, and \(\odot\) represents element-wise multiplication. More concretely, \({\text{f}}_{t}\) controls to what extent the old information can be ignored, \({\text{i}}_{t}\) controls how much new information can be added, and \({\text{o}}_{t}\) controls the output of the current cell. With the hidden dimension set to \(d\), \({{\varvec{h}}}_{t}\in {\mathbb{R}}^{d}\) at time \(t\), the whole semantic feature \(C\in {\mathbb{R}}^{d\times m}\) can be obtained. Similarly, \(C\) is mapped to logits \(a\in {\mathbb{R}}\) for label prediction and computing loss after passing through fully connected layers.

$$\left\{\begin{array}{c}{\text{i}}_{t}\text{=}\sigma \left({\text{W}}_{i}\cdot \left[{\text{h}}_{t-1},{\text{x}}_{t}\right]{\text{+}{\text{b}}}_{i}\right)\\ {\text{f}}_{t}\text{=}\sigma \left({\text{W}}_{f}\cdot \left[{\text{h}}_{t-1},{\text{x}}_{t}\right]{\text{+}{\text{b}}}_{f}\right)\\ {\text{q}}_{t}\text{=}\mathrm{tanh}\left({\text{W}}_{q}\cdot \left[{\text{h}}_{t-1},{\text{x}}_{t}\right]{\text{+}{\text{b}}}_{q}\right)\\ {\text{o}}_{t}\text{=}\sigma \left({\text{W}}_{o}\cdot \left[{\text{h}}_{t-1},{\text{x}}_{t}\right]{\text{+}{\text{b}}}_{o}\right)\\ {\text{c}}_{t}{\text{=}{\text{f}}}_{t}\odot {\text{c}}_{t-1}{\text{+}{\text{i}}}_{t}\odot {\text{q}}_{t}\\ {\text{h}}_{t}{\text{=}{\text{o}}}_{t}\odot \mathrm{tanh}\left({\text{c}}_{t}\right)\end{array}\right.$$


Convolutional Recurrent Neural Network (CRNN) is a hybrid model that combines CNNs and RNNs. In CRNNs, CNNs extract features, whereas RNNs are utilized as a temporal summarizer. The adoption of RNNs for feature aggregation allows “the networks to take the global structure into account while the remaining convolutional layers extract local features” (Choi et al., 2017, p. 2392). CRNN was initially developed by Tang et al. (2015) for text classification and is extensively used in other domains like image classification and music transcription.

Briefly, Peng et al. (2020a, b), CRNN architecture involves conv, recurrent, and transcription layers. The first layer extracts a feature sequence from each input data in automation mode. A recurrent network is used to predict each frame output's feature sequence through conv layers. The transcription layer transforms frame predictions into label sequences. Given that CRNN is composed of varied network structures, a loss function is used for joint training. Specifically, conv layers’ components are generated from conv and max-pooling layers employed in a standard CNN model. The components extract serialized feature representations from input data. Before being fed into the network, data is normalized. Then, vector sequences are extracted from feature maps and the sequences’ feature vector is produced column-by-column on feature maps. A deep Bi-RNN network is created above the conv layers to predict label distribution \({y}_{t}\) for each frame \({x}_{t}\) in feature sequence \(x={x}_{1},...,{x}_{T}\).


TextRCNN proposed by Lai et al. (2015) is a deep neural model for capturing text semantics. The input data is document \(D\) composed by a sequence of words \({w}_{1},{w}_{2},...,{w}_{n}\). The output data involves category components. Fundamentally, Recurrent Convolutional Neural Network (RCNN) is used for addressing RNN’s bias of the dominance of subsequent words (Mahmood et al., 2020). RCNN takes the recurrent structure’s advantages to capture contextual information and learn documents’ feature representations with the help of CNNs, alongside the use of a max-pooling layer to facilitate important words’ selection (Yan et al., 2018). The recurrent structure, CNNs, and max-pooling layer are dependent on the neighboring units whose activation in RCNN changes as time passes (Salehinejad et al., 2017). Such changes increase model depth while keeping the number of parameters the same via weight sharing between layers. RCNN was initially adopted in text classification, paraphrase detection, and semantic role labeling; subsequently, it has been utilized for finding the semantic relatedness of phrases and sentences in academic texts.

RCNN focuses on creating word representations composed of left context acquired from forwarding RNNs, word embeddings, and right context acquired from backwarding RNNs. Word representations are produced based on a bidirectional RNN and a max-pooling layer. Equations (9) and (10) represent the calculation of word wi’s left and right contexts represented by \({c}_{l}\left({w}_{i}\right),\) and \({c}_{r}({w}_{i})\cdot e({w}_{i-1})\) is word embedding.\({W}^{(l)}\) matrix converts the hidden layer into subsequent hidden layers. \({W}^{(sl)}\) matrix connects the current word’s semantics with the left context of the subsequent word. \(f\) represents a non-linear activation function. Equation (11) illustrates word representation by concatenating \({C}_{l}({W}_{i})\), left context vector, \(e({W}_{i})\), word embedding and \({C}_{r}({W}_{i})\), and right context vector. Each word representation \({x}_{i}\) is passed through a standard layer in which a linear transformation alongside the \(\mathrm{tanh}\) function is used and leads to \(y\) which involves a semantic vector used for finding highly valuable semantics in the text. Subsequently, a max-pooling layer is employed by using Eq. (12). The max-pooling layer is adopted to extract features of each word representation. In Eq. (12), the max function takes the maximum from components of a word representation \({x}_{i}\). Finally, the output layer can be calculated by Eq. (13), where \({y}_{i}^{(2)}\) is passed through a softmax function as Eq. (14) that transforms the output into possibility where text is classified into the most likely category. The network's training target manages to maximize a specific category's log likelihood. The network's weights are initialized by using a uniform distribution.



The Hierarchical Attention Network (HAN) model was developed by Yang et al. (2016) for text classification where a sentence’s representation was constructed through the processing of a sequence of its constituent words with the use of a bidirectional GRU. The sentences’ representations then go through sentence-level processing via another bidirectional GRU to construct document representation. A HAN model contains Word Encoder, Word Attention, Sentence Encoder, and Sentence Attention. First, sentence-level discourse segmentation is adopted for the division of a sentence into clauses. Then, a Bi-LSTM is used for modeling all clauses in the sentence and a word-level attention mechanism is adopted for capturing essential words in each clause. Finally, another Bi-LSTM is used for modeling the attentive representation of each clause, and a clause-level attention mechanism is implemented for capturing essential clauses in a sentence.

Algorithmically, HAN exploits the structure of the documents by encoding the text in two consecutive steps. First, a Bi-GRU (\(EN{C}_{w}\)) followed by a self-attention mechanism (\(EN{C}_{s}\)) turns the word embeddings (\({w}_{it}\)) of each section \({s}_{i}\) with \({T}_{i}\) words into a section embedding \({c}_{i}\). See Eqs. (1517) for details, in which \({u}^{(s)}\) is a trainable vector. Then, \(EN{C}_{d}\), another BIGRU with self-attention, converts the section embeddings (\(S\) in total, as many as the sections) to the final document representation \(d\). See Eqs. (1820) for details, in which \({u}^{(d)}\) is a trainable vector. The final decoder \(DE{C}_{d}\) of HAN is the same as in Bi-GRU-ATT.

$${a}_{it}^{(s)}=\frac{\mathrm{exp}({v}_{it}^{\rm T}{u}^{(s)})}{\sum_{j}\mathrm{exp}({v}_{it}^{\rm T}{u}^{(s)})}$$
$${a}_{i}^{(d)}=\frac{\mathrm{exp}({v}_{i}^{\rm T}{u}^{(d)})}{\sum_{j}\mathrm{exp}({v}_{j}^{\rm T}{u}^{(d)})}$$


Lin et al. (2017a, b) developed a sentence embedding model composed of a Bi-LSTM and a self-attention mechanism. Here, the self-attention mechanism is used to “provide a set of summation weight vectors for the LSTM hidden states”, and “the summation weight vectors are dotted with the LSTM hidden states, [and] the resulting weighted LSTM hidden states are considered as an embedding for the sentence” (Lin et al., 2017a, b, p. 2). Algorithmically, TextSANN adopts an attention mechanism over Bi-LSTM’s hidden states to generate a representation \(u\) of an input sentence. In the attention mechanism, \(\{{h}_{1},...,{h}_{T}\}\) are Bi-LSTM’s output hidden vectors, which are fed to an affine transformation \((W,{b}_{w})\) to output a set of keys \(({\overline{h} }_{1},...,{\overline{h} }_{T})\). Here, “the \(\left\{{\alpha }_{i}\right\}\) represent[s] the score of similarity between the keys and a learned context query vector \({u}_{w}\), and these weights are used to produce the final representation \(u\), which [is] a weighted linear combination of the hidden vectors” (Conneau et al., 2017, p. 673).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Zou, D., Cheng, G. et al. Deep neural networks for the automatic understanding of the semantic content of online course reviews. Educ Inf Technol (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: