1 Introduction

Massive open online courses (MOOCs), enabling the delivery of high-quality education on an unprecedented level in terms of cost-effectiveness and worldwide accessibility [1], have captured the attention of researchers to explore diverse topics such as MOOC classifications, learning engagement, and concept recommendations. For example, utilizing 102,184 reviews across 401 MOOCs as the foundation, Chen et al. [2] devised DNN-powered models to autonomously differentiate a set of semantic groupings. Wei et al. [3] examined the correlations between motivation, perceived support, engagement, and self-regulated learning strategies concerning learners’ perceived outcomes in MOOCs using an online survey involving 546 participants. Gong et al. [4] focused on reinforcement learning and heterogeneous information network-driven concept recommendations by leveraging the interactions between users and knowledge concepts, and among users, courses, videos, and concepts. Gong et al. [5] proposed an attention-driven, heterogeneous graph convolutional deep knowledge recommender designed to suggest knowledge concepts within MOOCs. The recommender harnessed content and contextual information to master entity representation through graph convolution networks.

The proliferation of MOOCs has generated concerns about and prompted extensive research on courses’ quality and effectiveness. Currently, MOOC websites only provide rough overall ratings, making it challenging to differentiate between various courses, especially in the cases of overall course ratings displayed on the MOOC website. Therefore, there is a pressing need to obtain detailed insights into the performance of MOOCs based on different evaluation criteria [6]. As learners have diverse needs, some may prioritize assessment aspects, while others may value interaction. By displaying course performance across different dimensions, learners can identify courses that excel in criteria that are most relevant to their preferences.

Traditional top-down methods for MOOC course evaluation included surveys, expert interviews, and literature reviews [7,8,9]. Survey-based methods are frequently utilized for exploring learner engagement, satisfaction, and the intention to continue with courses. For instance, Dai et al. [9] revealed that learners’ attitude significantly impacted their intent to continue, and MOOC instructors should be cautious in their course promotions to avoid overemphasizing benefits. Similarly, based on 622 structured questionnaires from undergraduate students in Malaysia, Albelbisi et al. [8] revealed that (1) system quality positively influenced satisfaction, (2) satisfaction and service quality positively affected self-regulated learning, and (3) system quality affected self-regulated learning through satisfaction. Interviews, although involving fewer participants, present the advantage of flexible and in-depth questioning to gain deeper insights into learners’ motivations, course completion, and satisfaction levels. For instance, Zhu et al. [7] carried out semi-structured discussions with 15 online learners, revealing that learners’ contentment was influenced by course design factors (such as well-structured organization) and instructional methodologies (such as instruction presence). However, surveys and interviews suffer from delayed feedback, time and cost constraints, expert-centricity, and a lack of learner perspectives. To address these limitations, adopting a scientific approach for systematically and objectively assessing the quality of online courses becomes crucial. Such an approach is pivotal in diagnosing and improving online course quality.

Due to the progress in big data and text mining methodologies, scholars have shifted their focus towards utilizing online course review data to obtain valuable insights for enhancing course quality [10]. This approach facilitates a more comprehensive comprehension of learners’ emotions, actions, course acceptance, platform comparisons, and prevailing trends, contributing to a better grasp of the crucial elements influencing MOOC success [11]. However, existing methods for determining online course quality evaluation indicators and [or] their weights often rely on group decision-making [1, 12, 13], resulting in poor adaptability and applicability for fine-grained evaluation of different types of courses. Expert simulation evaluations and the use of pre-determined indicators may not accurately reflect learners’ experiences and requirements, resulting in a lack of learner-centeredness in course evaluation. Thus, there is a need to leverage crowdsourced data from MOOC platforms, artificial intelligence (AI), and text analysis techniques to empower researchers to effectively tap into the collective wisdom and expertise of learners to enhance course design and learner contentment [14].

To address deficiencies in prior studies, this study focuses on learners’ learning experiences and needs, breaking away from the reliance on experiential judgment and excessive human subjectivity in traditional online course quality research. The objective is to leverage unstructured data from student feedback texts, utilizing text mining and hierarchical structure modeling to develop an intelligent evaluation model and implementation plan for online course quality.

Accordingly, the present study formulates three research questions (RQs):

  • RQ1: What factors will be included in the multi-criteria decision-making framework? What are the similarities and differences between the selection of knowledge- and skill-seeking courses?

  • RQ2: What are the most influential factors in course selection? What are the similarities and differences between the selection of knowledge- and skill-seeking courses?

  • RQ3: What are the ratings of courses and how do they rank according to sub-criteria and criteria?

This research endeavors to answer these questions by presenting a novel crowdsourcing technique that utilizes text mining and analytic hierarchy process (AHP) to automatically evaluate MOOCs. The evaluation is based on the aggregation of learners’ reviews collected from 169 MOOCs on Class Central. Considering RQ1, this study integrates topics (sub-criteria) identified through topic modeling and interpreted under the framework of transactional distance and technology acceptance theories to form the hierarchical structure of MOOC evaluation criteria. Regarding RQ2, this study leverages the probability distribution of topics (sub-criteria) identified through topic modeling to weigh the relative importance between criteria. Subsequently, addressing RQ3, based on the established hierarchical structure and the respective criterion weights, this study employs AHP to rank the online courses to determine their overall relative advantage and relative advantage within each criterion. By doing so, this study can provide a finely-tuned analysis methodology for large-scale online educational course quality evaluation using a text mining perspective, thus providing the necessary technical support for investigating large-scale online course quality evaluation.

2 MOOC evaluation based on review mining

Currently, online course quality evaluation research primarily focuses on the construction of evaluation indicators and models. These studies often rely on traditional methods such as literature review, questionnaire surveys, and expert scoring. However, these methods are time-consuming and costly, and they are subject to experiential evaluations and subjective interventions, resulting in poor adaptability of indicators and difficulty in conducting detailed evaluations for different course types [12, 15].

Consequently, the investigation into developing theories to assess the caliber of online courses using text mining from student feedback has garnered considerable interest. The main achievements in this field involve using topic modeling to automatically identify latent evaluation topics from text data [16,17,18]. Through the amalgamation of findings from topic modeling and undertaking theoretical examination, course evaluation indicators can be formulated. Based on this, course evaluation indicators can be hierarchically aggregated according to their affiliations, forming a multi-level analytical structural model, thus accomplishing the establishment of the theoretical framework for course evaluation, and authentically reflecting the learners’ demands, guiding the course evaluation practice effectively.

Traditional research on course quality evaluation often relies on qualitative group decision-making and expert simulation evaluation methods, neglecting the role of learners as the primary stakeholders. Moreover, the use of pre-determined indicator factors often results in measurement items that fail to accurately reflect learners’ learning experiences and needs, making it difficult to achieve large-scale, normalized, and continuous course quality monitoring. However, learners’ perceived learning experiences are crucial references for online course design and quality improvement. Therefore, it is necessary to focus on learners’ experiences and needs and use text mining in course evaluations to provide important foundations for the evaluation of online teaching effectiveness from the learners’ perspective [1].

However, existing research on online course quality evaluation using text mining often relies on subjective assignment methods such as AHP [19] to obtain indicator weights [1], heavily relying on expert opinions and judgments, which fail to reflect the learners’ true experiences and needs. By utilizing the estimated popularity of evaluation topics obtained via topic modeling as weights for course evaluation indicators to represent the learners’ levels of interest, achieving automated customization of indicator weights and enabling automatic sorting of various courses to be evaluated [20].

3 Data preparation

3.1 Review dataset collection

The MOOC reviews were scraped from Class Central using a self-developed crawler, which was then parsed into Excel files for further processing. Upon removing duplicates and MOOCs with less than 20 review comments, non-English reviews were identified and subsequently excluded. using a Python package named “langid”, a standalone language identification tool. Python with TextBlob was used to automatically check and correct spelling. For instance, “I havv goood speling” was corrected to “I have good spelling”. Two types of courses were considered in this study. The first relates to Art, Design, and Humanities, which is generally in a domain that is “knowledge-seeking”. The second relates to Computer Science, Engineering, and Programming, which is generally in a domain that is “skill-seeking”. As a result, a total of 52,881 reviews were obtained.

3.2 Helpful review identification

Among the 52,881 reviews, 4407 reviews had a helpful vote value. For these 4407 reviews, this study followed O’Mahony and Smyth [21] to define the top 75% ranked by the number of the helpful votes they received as helpful reviews, while the rest were treated as unhelpful reviews. An examination of 50 randomly selected unhelpful reviews suggested that they mostly lacked valuable information pertaining to specific aspects of MOOC courses. Exampled included: “great class”, “the class was amazing”, “omg so good”, “I loved this class”, “one of the best I have ever taken”, “strongly recommended”, and “I don’t know what to say”. As a result, 3305 helpful reviews, labeled as “1” and 1102 unhelpful reviews, labeled as “0”, were randomly divided into training and testing datasets to train and test the classifier based on a Naive Bayes model and Word Level TF-IDF to automatically predict labels (“1” and “0”) of the 48,474 reviews without a helpful vote value. The prediction results, combined with the previously labeled sample, comprised the datasets of D1 and D2 for (1) Art, Design, and Humanity and (2) Computer Science, Engineering, and Programming courses, respectively. The numbers of courses and reviews for D1 and D2 datasets are presented in Table 1. Specifically, the D1 dataset comprised 63 courses and 6940 helpful reviews, taking proportions of 37.28% and 13.44% of the total number of courses and helpful reviews included, respectively. The D2 dataset comprised 106 courses and 44,697 helpful reviews, taking proportions of 62.72% and 86.56% of the total number of courses and helpful reviews included, respectively.

Table 1 Number of courses and reviews for the D1 and D2 datasets

3.3 Data preprocessing

To preprocess the data for topic modeling, six sequential actions were executed: (1) Tokenization and exclusion of special characters, (2) normalization, (3) elimination of stop words, (4) lemmatization, (5) term selection, and (6) the formation of a term-document matrix. In step 1, every sentence extracted from review comments was disassembled into a word list, with the exclusion of special characters and punctuation. Step 2 converted uppercase letters to lowercase. Step 3 played a pivotal role in the data preprocessing process. During this phase, numbers, punctuations, symbols, and stop-words (e.g., “me”, “I”, “or”, “him”, “a”, and “they”) were excluded, as they “appear frequently and are insufficiently specific to represent document content (p. 976)” [22]. Step 4 aimed to analyze vocabulary and morphological aspects of words. Its primary objective is to eliminate inflectional endings and obtain the root form. For instance, the lemma of “courses” is “course”, and “assessing” is transformed into “assess”, while “mice” becomes “mouse”. This lemmatization process holds great importance in text mining. Another common method for reducing terms to their base form is stemming. However, stemming was not utilized as it often collapses derivationally related words [23]. For instance, the stem of “organized” would be “organ”, while the lemma remains “organize”. In this case, stemming may lead to challenges in correctly interpreting term stems. After lemmatization, less significant terms were excluded in step 5 using TF-IDF. In step 6, we created a term-document matrix, representing the occurrence of terms (rows) in documents (columns), thereby constituting the corpus for the STM algorithm.

4 Methods

4.1 Creating STM models

In our analysis of MOOC reviews, we adopted the structural topic model (STM) [24] as it offers notable advancements compared to latent Dirichlet allocation (LDA). LDA and STM are both Bayesian generative frameworks employed in the field of topic modeling. They posit that a topic embodies a probability distribution over words, and each document comprises a blend of topics spanning the entire corpus [25]. However, STM presents significant enhancements by incorporating document-level structural information, which influences the prevalence of topics at the document level (i.e., per-document topic proportions) and the distribution of words within a topic (i.e., topic-word distributions). This approach allows for a more targeted examination of the influence of covariates on the text content.

Figure 1 presents a graphical representation of the technical distinctions between STM and LDA. Each variable is depicted as a node. The nodes without shading indicate latent variables, whereas the nodes with shading denote observable variables. The rectangular shapes denote replication: \(n\in \{\mathrm{1,2},...,N\}\) signifies the indexing of words within a document; \(k\in \{\mathrm{1,2},...,K\}\) is the indexing of each topic, based on a predefined number of topics, \(K\); and \(d\in \left\{\mathrm{1,2},...,D\right\}\) denotes the indexing of documents. Additionally, Fig. 1 depicts that solely node \(w\) (i.e., words within documents) is directly measurable in the two models. Therefore, the primary objective lies in deducing the concealed topic information from the observable words, \(W\), and producing matrices: document-topic proportions, \(\theta\), and topic-word distributions, \(\beta\).

Fig. 1
figure 1

Plate diagram comparison of LDA and STM adapted from [26]

In this study, the textual information from the beneficial reviews in D1 and D2 served as the data input for topic modeling. For both datasets, we generated 36 distinct STM by altering the \(K\) within the range of 5 to 50.

4.2 Exclusivity and coherence measures

This study selected models by considering exclusivity and semantic coherence metrics, alongside manual verification. Mimno et al. [27] introduced semantic coherence as a measure closely associated with pointwise mutual information [28]. This criterion reaches its peak when the most likely words within a particular topic frequently appear together. Minmo et al. demonstrated a strong correlation between semantic coherence and human evaluations of topic quality. To formalize this, consider \(D({v}_{i},{v}_{j})\) as the count of occurrences where words \({v}_{i}\) and \({v}_{j}\) coexist within a document. The semantic coherence for topic \(k\) is given by Roberts et al. [29] as Eq. (1), where \(M\) represents the top \(M\) most likely words within topic \(k\). Each model computes an aggregate coherence score by determining the coherence of each topic separately and subsequently averaging these individual values.

$$C_k = \mathop \sum \limits_{i = 2}^M \mathop \sum \limits_{j = 1}^{i - 1} \log \left( {\frac{{D\left( {v_i ,v_j } \right) + 1}}{{D\left( {v_j } \right)}}} \right)$$
(1)

Exclusivity measures the degree to which the primary words within a topic are unique to that specific topic and not prevalent among others. This value is essentially an average, calculated for each top word, where the probability of that word within the topic is divided by the total probabilities of that word across all topics. The FREX metric [30] quantifies exclusivity while considering word frequency. FREX represents the weighted harmonic mean of a word’s rank concerning exclusivity and frequency [29] shown as Eq. (2). ECDF stands for the empirical cumulative distribution function, \(\omega\) is the weight (typically set to 0.7 to prioritize exclusivity), \(k\in K\) denotes the kth topic, \(v\) signifies the word being evaluated, and \(\beta\) refers to the topic word distribution for that specific topic. The cumulative distribution function of a real-valued random variable \(X\), computed at \(x\), indicates the probability of \(X\) assuming a value less than or equal to \(x\). In contrast, the ECDF represents the probability distribution derived from the sampled dataset rather than the entire population.

$${\text{FREX}}_{k,v} = \left( {\frac{\omega }{{{\text{ECDF}}\left( {\beta_{k,v} /\sum_{j = 1}^K \beta_{j,v} } \right)}} + \frac{1 - \omega }{{{\text{ECDF}}\left( {\beta_{k,v} } \right)}}} \right)^{ - 1}$$
(2)

In this study, both the coherence and exclusivity of each topic within a model are computed using the manyTopics function within the stm R package [29]. These values were then averaged across all topics to derive the model’s overall score. Models exhibiting elevated exclusivity and semantic coherence were typically favored. For D1, among the 36 estimated models, six (K = 19, 20, 21, 22, 23, 24, 25, and 26) surpassed the others and were chosen for further comparisons. Subsequently, a qualitative assessment of the eight models was performed by examining terms and reviews to identify the most suitable model, which turned out to be the one with 22 topics. Hence, based on both quantitative and qualitative evaluations, the model with 22 topics was selected as the ultimate choice. Employing a similar approach, the optimal topic model for D2 was found to be the one with 24 topics. Figures 2 and 3 present the outcomes of model diagnosis using exclusivity and coherence for D1 and D2.

Fig. 2
figure 2

Model diagnosis based on exclusivity and coherence for the D1 dataset

Fig. 3
figure 3

Model diagnosis based on exclusivity and coherence for the D2 dataset

4.3 Topic representation and assignment

The findings from the optimal models on D1 and D2 were used to identify the discriminant terms, also known as commonly employed words, organized in a descending sequence based on their frequency within each topic. For every topic, the topic proportions were computed using \({{\text{P}}}_{{\text{k}}}=\left(\sum_{{\text{d}}}{\uptheta }_{{\text{d}},{\text{k}}}\right)/D\), based on a matrix indicating the connection between course reviews and topics in terms of proportion. In this equation, \({P}_{k}\) represents the frequency of the \({k}_{th}\) topic within the data corpus; \({\theta }_{d,k}\) represents the prevalence of the \({k}_{th}\) topic in the \({d}_{th}\) course review, and \(D\) indicates the total number of course reviews.

To provide a more concise view of the topics, we needed to condense and categorize them. As indicated in Chang et al. [31], accomplishing this task necessitates human judgments and interventions to assign appropriate labels or titles to the topics. This is determined by the semantic relatedness of important terms within the topic matrix, ensuring a more accurate representation of the topic contents. In the quest to capture shared opinions in review comments, [32] identified six coding categories derived from transactional distance theory to grasp learners’ encounters in MOOCs, consisting of “structure”, “videos”, “instructors”, “content”, “interaction”, and “assessment”. In line with technology acceptance theories, Du and Gao [33] developed four primary factors, encompassing usefulness, enjoyment, technicality, and effort, as potential indicators of AI-based educational application adoption. Grounded on previous studies, this study included nine factors (i.e., “assessment”, “content”, “effort”, “enjoyment”, “faculty”, “interaction”, “structure”, “technicality”, and “usefulness”) as predetermined labels (criteria) to understand MOOC learners’ concerns. Subsequently, based on the nine criteria, along with the semantic similarity of crucial terms within the topics, two experienced experts specializing in MOOC education deduced and condensed sub-labels (sub-criteria) for each topic in the optimal models for D1 and D2. This labeling process adhered rigorously to a procedure widely adopted in prior topic modeling studies (e.g., [34, 35]). To maximize precision, this study also involved multiple rounds of in-depth discussions among the two experts until a consensus was reached. Topics that merely described the course content without offering any opinions or evaluations on its quality were omitted. To maintain consistency, the proportions (\({P}_{k}\)) of the included topics were adjusted by normalizing their entries and dividing each entry by the sum of all entries. Finally, 13 relevant topics were retained for D1, and 18 relevant topics were retained for D2.

As a result, the nine criteria, along with their sub-criteria, were utilized to highlight learners’ major concerns. For example, this study introduced the predetermined label of “faculty” to encompass a comprehensive range of factors directly related to instructors within the MOOC context. The sub-labels under “faculty” encapsulate various aspects related to instructors, for example, presentation style, passion, humor, teaching methodologies, overall experience, and the collaborative dynamics within the faculty team. This study also introduced the predetermined label of “technicality” to encapsulate various essential components shaping the quality of MOOC courses. This label was structured with sub-labels, namely “complexity” and “flexibility”, each of which encompasses intricate aspects that intricately involve considerations regarding the role and impact of technological factors such as videos within MOOCs [33]. Specifically, within the “complexity” sub-label, we incorporated technical aspects that are central to the quality of video content in MOOCs, for example, video resolution, streaming capabilities, interactivity, and accessibility features. Furthermore, the “flexibility” sub-label was designed to address the adaptability and user-centric features of video content within MOOCs, for example, the ability of videos to cater to diverse learning styles and their responsiveness across devices.

  • “assessment”: assignment, autograder

  • “content”: problem-solving, explanation, use of cases/examples

  • “effort”: perceived difficulty, perceived workload

  • “enjoyment”: pleasure, satisfaction

  • “faculty”: instructor humor, instructor presentation, instructor passion, teaching style, instructor experience, faculty team

  • “interaction”: interaction

  • “structure”: course description

  • “technicality”: complexity, flexibility

  • “usefulness”: knowledge enhancement, job preparedness, beginner friendliness

4.4 Determining criteria, weights, and alternatives in AHP

This stage holds immense significance when facing a MOOC selection dilemma. Utilizing the outcomes of topic modeling for D1 and D2, we ascertain the criteria, sub-criteria, and their respective weights. Based on this information, two decision hierarchies can be created, as exemplified in the selection of MOOC courses for Art, Design, and Humanities as well as Computer Science, Engineering, and Programming, respectively. The weight assigned to each criterion is the total of the weights allocated to its sub-criteria (i.e., topic label) using the formula \({P}_{cr}=\sum_{k=1}{P}_{k}\).

In this research, only MOOC courses with over 100 reviews and an overall rating score of ≥ 4.7 stars (where 1 signifies terrible and 5 denotes excellent) were taken into consideration during the course selection process. This study chose to analyze courses with an overall rating score of 4.7 or above because of the common challenge users encounter when attempting to choose among courses with marginal differences in overall ratings. Our method aims to encourage users to delve deeper into the evaluation process for the initially selected highly rated courses. Thus, we specifically focused on comparing highly-rated courses across fine-grained dimensions. Also, considering the caliber of the reviews and the extensive data at hand for assessing these courses we opted for courses with a score of ≥ 4.7 stars that had garnered at least 100 reviews. This method provides adequate time for assessing the course’s performance. Accordingly, 9 courses relating to Art, Design, and Humanity as well as Computer Science, Engineering, and Programming were used as alternatives in the AHP (Tables 2 and 3).

Table 2 Appointed alternatives for Art, Design, and Humanity courses
Table 3 Appointed alternatives for Computer Science, Engineering, and Programming courses

4.5 Course ranking via AHP

Once the hierarchical structure for MOOC selection was established, the prioritization process determined the comparative importance of the MOOCs. All alternatives undergo evaluation by learners from Class Central, making this a collective decision-making analysis.

For each course, the score assigned by a reviewer (i.e., learner) for the \({k}_{th}\) topic (sub-criteria) is calculated using \({a}_{k}=\sum_{c=1}{r}_{c}{p}_{ck}\), where \({r}_{c}\) represents the overall rating score provided by the \({c}_{th}\) reviewer (i.e., learner), and \({p}_{ck}\) denotes the percentage of the \({c}_{th}\) review comment assigned to the \({k}_{th}\) topic. Consequently, the score for each criterion is the summation of the calculated scores for each of its sub-criteria using \({a}_{cr}=\sum_{k=1}{a}_{k}\). Subsequently, the pairwise comparison matrices were formulated as Eq. (3) based on the acquired data. The pairwise comparisons consist of n × n elements, where n denotes the number of factors under consideration.

$$a_{ij} = \frac{1}{{a_{ij} }},\quad {\text{where}}\,i,j = 1,2,3, \ldots ,n$$
(3)

Using a nine-point scale, each element denotes the assessment of the comparative significance of two factors. Element \({a}_{ij}\) signifies the extent to which the \(i{\mathrm{th}}\) factor dominates the \(j{\mathrm{th}}\) factor, resulting in the placement of the \({a}_{ij}\) factor in the \(i{\mathrm{th}}\) row and \(j{\mathrm{th}}\) column, while its reciprocal is placed in the \(j{\mathrm{th}}\) row and \(i{\mathrm{th}}\) column. Thus, a comparison matrix is drawn from Eq. (4). To derive priorities for each criterion, denoted as \(prioritie{s}_{cr}\), we initially compute the principal right eigenvector for each matrix and subsequently normalize its entries through division by their collective total. Afterward, the priorities for each MOOC course can be determined by computing the weighted sum score using \(\sum ({P}_{cr}\times prioritie{s}_{cr})\).

$$A = \left( \begin{gathered} \begin{array}{*{20}c} {a_{11} } & {1/a_{21} } & \cdots & {1/a_{n1} } \\ \end{array} \hfill \\ \begin{array}{*{20}c} {a_{21} } & { \, a_{22} } & { \, \cdots } & {1/a_{n2} } \\ \end{array} \hfill \\ \, \begin{array}{*{20}c} \vdots & { \, \vdots } & { \, \ddots } & { \, \vdots } \\ \end{array} \hfill \\ \begin{array}{*{20}c} {a_{n1} } & { \, a_{n2} } & { \, \cdots } & { \, a_{nn} } \\ \end{array} \hfill \\ \end{gathered} \right)$$
(4)

5 Results

5.1 Topic identification

The outcomes of the optimal STM model on D1 and D2 can be found in Figs. 4 and 5 and Tables 10 and 11 in the “Appendix”. Extracted from the topic modeling results for D1 and D2 are the criteria, sub-criteria (referred to as effective topic labels), and adjusted weights (termed adjusted topic proportions). The 8 criteria considered in selecting Art, Design, and Humanities courses are “assessment”, “content”, “effort”, “enjoyment”, “faculty”, “interaction”, “structure”, and “usefulness”. Furthermore, some criteria have sub-criteria. For instance, the criterion of “usefulness” encompasses sub-criteria of “knowledge enhancement”, “job preparedness”, and “beginner friendliness”. As for the selection of Computer Science, Engineering, and Programming courses, 9 criteria were considered: “assessment”, “content”, “effort”, “enjoyment”, “faculty”, “interaction”, “structure”, “technicality”, and “usefulness”. Similar to the previous case, certain criteria possess sub-criteria at the subsequent level. For instance, the criterion of “faculty” consists of sub-criteria of “instructor humor”, “instructor presentation”, “instructor passion”, and “teaching style”.

Fig. 4
figure 4

Extracted valid topics and their adjusted weights for D1

Fig. 5
figure 5

Extracted valid topics and their adjusted weights for D2

Using the identified criteria and sub-criteria, two decision hierarchies were formulated, as illustrated in Figs. 6 and 7, for MOOC selection. Next, by referring to the adjusted weights (adjusted topic proportions) presented in Fig. 4, the weight of the criterion “usefulness” is computed as follows: 0.0569 (adjusted weight of “knowledge enhancement”) + 0.1681 (adjusted weight of “job preparedness”) + 0.0815 (adjusted weight of “beginner friendliness”) = 0.3065. For criteria without any sub-criteria, such as “interaction”, the adjusted weight of “interaction”, which is 0.0506, is used directly. The weights for other criteria are as follows: 0.1211 (“effort”), 0.0924 (“assessment”), 0.0996 (“content”), 0.2697 (“faculty”), 0.0318 (“structure”), and 0.0285 (“enjoyment”).

Fig. 6
figure 6

Decision hierarchy for Art, Design, and Humanities course selection

Fig. 7
figure 7

Decision hierarchy for Computer Science, Engineering, and Programming course selection

Similarly, by examining the adjusted weights (adjusted topic proportions) presented in Fig. 5, the weights of the criteria are as follows: 0.1581 (“enjoyment”), 0.0792 (“structure”), 0.1963 (“faculty”), 0.1619 (“assessment”), 0.0703 (“content”), 0.0425 (“interaction”), 0.0998 (“usefulness”), 0.1403 (“effort”), and 0.0517 (“technicality”).

5.2 Course ranking

The scores for each criterion in the case of Art, Design, and Humanities courses are presented in Table 4. Eight sets of pairwise comparison matrices were utilized to evaluate and rank the quality of courses within Art, Design, and Humanities, as listed in Tables 5 and 6, and Tables 12, 13, 14, 15, 16, 17 and 18 in the “Appendix”. Let the \(n\)-by-\(n\) pairwise comparison matrix be denoted as \(A=\left({a}_{ij}\right)\), where \(i,j=\mathrm{1,2},\dots ,n\). The entry \({a}_{ij}\) is determined by \({w}_{i}/{w}_{j}\), where \({w}_{i}\) and \({w}_{j}\) represent the weights of alternatives \(i\) and \(j\) with respect to certain criteria. For the purpose of comparison, the nine courses are enumerated on the left-hand side and at the top, and an evaluation is conducted to assess the degree of dominance of courses on the left over those at the top. For instance, in Table 5, when we compare course A1 on the left with course A2 at the top with respect to the “assessment” criterion, we cross-reference Table 4, where the “assessment” ratings for A1 and A2 are 93.058 and 170.572. Consequently, we record the value 0.546 (93.058/170.572) in the cell located at the intersection of the first row and second column in Table 5, while the value 1.833 (170.572/93.058) is automatically populated in the cell at the intersection of the second row and first column in Table 5. In this scenario, it is observed that the quality of course A1, specifically regarding the “assessment” criterion on the left, does not surpass that of course A2 at the top.

Table 4 Scores for each criterion in Art, Design, and Humanity course selection
Table 5 Pairwise comparison matrix for the alternatives regarding assessment in Art, Design, and Humanity course selection
Table 6 Numerical values for ratings of the alternatives in Art, Design, and Humanity course selection

To gain the priorities for each table, this study calculated the principal right eigenvector for each matrix and normalized its values by dividing each entry by the total sum. This allows ranking each course according to its priorities for each criterion. The last column in Table 5 indicates the ranking position of the nine courses in the case of the criterion “assessment”. The ranking results were determined by the priorities of the criterion of “assessment” presented in the penultimate column of Table 5. The results showed that A3 is ranked at the top regarding “assessment”, followed by A2, and A5, and A7 is ranked as the lowest.

In Table 6, this study determined the priorities by computing the sum score weighted accordingly. For instance, the priorities of A1 are calculated as follows: 0.09236552 (the weight of “assessment”) × 0.0804106 + 0.0995969 (the weight of “content”) × 0.0448651 + 0.12108072 (the weight of “effort”) × 0.07660304 + 0.02847555 (the weight of “enjoyment”) × 0.0702063 + 0.26967681 (the weight of “faculty”) × 0.24214347 + 0.05056552 (the weight of “interaction”) × 0.07088712 + 0.03175832 (the weight of “structure”) × 0.06318701 + 0.30648066 (the weight of “usefulness”) × 0.0448499 = 0.10780717. Our method ranked alternatives within Art, Design, and Humanities course selection, as presented in column 2 of Table 6. The ranking results were determined by the priorities in column 3 of Table 6, showing that A3 is ranked at the top, followed by A2, and A1, and A7 is ranked as the lowest.

Analogous computations are conducted to determine the rankings for the alternatives in the selection of Computer Science, Engineering, and Programming courses, as displayed in Table 7. A total of 9 pairwise comparison matrices are formulated using data from Table 8 and Tables 19, 20, 21, 22, 23, 24, 25 and 26 in the “Appendix”, aiming to evaluate the Computer Science, Engineering, and Programming courses based on the criteria of “enjoyment”, “structure”, “faculty”, “assessment”, “content”, “interaction”, “usefulness”, “effort”, and “technicality”. For example, the last column in Table 8 indicates the ranking position of the nine courses in the case of the criterion of “enjoyment”. The ranking results were determined by the priorities of the criterion “enjoyment” presented in the penultimate column of Table 8. The results showed that B3 is ranked at the top regarding “enjoyment”, followed by B8, and B7, and B6 is ranked as the lowest.

Table 7 Score for each criterion in Computer Science, Engineering, and Programming course selection
Table 8 Pairwise comparison matrix for alternatives regarding enjoyment in Computer Science, Engineering, and Programming course selection

The rankings for alternatives are presented in column 2 in Table 9 for Computer Science, Engineering, and Programming course selection. The ranking results were determined by the priorities in column 3 of Table 9, showing that B3 is ranked at the top, followed by B8, and B7, and B6 is ranked as the lowest.

Table 9 Numerical values for ratings of the alternatives in Computer Science, Engineering, and Programming course selection

6 Discussion

6.1 RQ1: What factors will be included in the multi-criteria decision-making framework? What are the similarities and differences between the selection of knowledge- and skill-seeking courses?

This study revealed notable distinctions between subjects. The decision hierarchies presented in Figs. 4 and 5 offer valuable insights into the factors essential for multi-criteria decision-making in course selection. For both Art, Design, and Humanities (D1) and Computer Science, Engineering, and Programming (D2) courses, eight criteria: “assessment”, “content”, “effort”, “usefulness”, “enjoyment”, “faculty”, “interaction”, and “structure” were considered. These criteria are widely acknowledged for their impact on MOOC learners' enrollment, satisfaction, and reuse intention [32, 36, 37].

The sole distinction lies in the inclusion of “technicality” for Computer Science, Engineering, and Programming courses. The “technicality” refers to the non-monetary cost related to the use of technology and plays a crucial role in users’ technology adoption [38]. It consists of two sub-criteria: “complexity” and “flexibility”. Complexity pertains to the ease of use of technology, wherein a user-friendly system requiring minimal physical and mental effort leads to a smoother and more satisfying learning experience. Flexibility involves the adaptability of MOOCs regarding time, space, and learning pace, enabling learners to personalize their learning experience, which enhances their satisfaction [39].

Regarding the eight shared criteria between D1 and D2 courses, inconsistencies were observed in the sub-criteria. For “assessment”, D1 only includes “assignment” as a sub-criterion, while D2 adds “autograder”. This finding supports the research by Ramesh et al. [40], highlighting the significance of auto-gradable hands-on programming assignments in scalable programming education. The “autograder”, providing automatic feedback on programming exercises, enhances learners' satisfaction and understanding of code [41].

Another disparity lies in the inclusion of “explanation” as a sub-criterion under the “content” criterion for D2, which is not present in D1. Learners in Computer Science, Engineering, and Programming courses value the clear explanation of complex concepts, making it essential for their continued learning [42]. The ability to explain such abstract scientific concepts in an accessible manner is crucial, emphasizing the importance of problem-based learning and explicit explanations to enhance participation in online courses [43].

Furthermore, an illustrative case is the consideration of instructor humor as a sub-criterion within the “faculty” category for D2, while it is absent in D1. According to Watson et al. [44], this aspect of instructors, humor, is thought to contribute to heightening the emotional aspect of social interaction, referring to the extent to which users using computer-assisted communication systems experience an emotional connection with one another [45]. Extensive literature elucidates the significant influence of both instructor presence and social presence on learners’ satisfaction with online learning (e.g., [46, 47]). In the realm of Computer Science, Engineering, and Programming courses, where abstract and enigmatic concepts often surface, instructors’ humor plays a particularly crucial role in inspiring learners to actively engage with and grasp these notions.

6.2 RQ2: What are the most influential factors in course selection? What are the similarities and differences between the selection of knowledge- and skill-seeking courses?

Among the criteria with a weight of more than 10%, “effort” and “faculty” appeared in both D1 and D2. When considering distinctions, the evaluations of the criteria “assessment”, “enjoyment”, and “usefulness” differ between D1 and D2.

6.2.1 Effort

Regarding Art, Design, and Humanities courses, the “effort” criterion contributes 12.11% to the overall evaluation process, while for the selection of Computer Science, Engineering, and Programming courses, it carries a weight of 14.03%.

The effort is inherently intertwined with the technological aspects mentioned previously. As MOOCs are designed with multifaceted functionalities to meet users’ expectations, their adaptation requires certain extents of mental and physical exertion, for example, time, energy, financial commitment, engagement, and utilization of resources. Through our topic modeling analysis, we identified perceived difficulty and perceived workload as indicators of perceived effort.

Perceived workload pertains to students’ perception of the workload and time commitment required for completing course tasks [32]. Perceived difficulty refers to a student’s perception of how easy or challenging it is to comprehend the course content [48]. Before enrolling, learners often evaluate the course description, syllabus, and any available information about the workload and course difficulty. When learners perceive the workload and course difficulty as high, they are likely to perceive the effort required to complete tasks as substantial. Consequently, if the perceived workload and course difficulty align with their expectations, available time, and resources, learners are more inclined to enroll. However, if learners anticipate a high workload and course difficulty that surpasses their capacity or conflicts with their goals, it can result in increased stress, anxiety, and reduced satisfaction, and may impact learners’ decisions to enroll in a MOOC. Our findings align with the observations in typical in-person classroom settings. Howell and Buck [49] reported a significant association between workload and student satisfaction and engagement in face-to-face learning. Nonetheless, the correlation between perceived workload and perceived effort may vary depending on individual factors, such as prior knowledge and learning preferences.

6.2.2 Faculty

Regarding the “faculty” criterion, it contributes 26.97% and 19.63% to the evaluation of Art, Design, and Humanities, and Computer Science, Engineering, and Programming courses, respectively. Our topic modeling analysis results show that within the context of MOOCs, the “faculty” criterion consists of six sub-criteria: instructor experience, faculty team, instructor passion, instructor humor, instructor presentation, and teaching style. These factors have been extensively emphasized in the literature as crucial elements in MOOCs (e.g., [6, 50, 51]).

The significance of “faculty” as a factor influencing MOOC learners’ satisfaction can be attributed to several reasons. Firstly, experienced instructors possess vast knowledge and expertise, effectively guiding students, addressing queries, and providing real-world examples. This enriches the learning experience and fosters learner satisfaction. Secondly, a cohesive and supportive faculty team creates a positive environment. MOOC platforms offer interactive tools that assist the faculty team, including instructors and teaching assistants, in problem-solving and facilitating learning inquiry [52], thereby enhancing personalized attention to students. Thirdly, passionate instructors, go beyond the curriculum, and inspire and motivate students. Their enthusiasm encourages exploration and deeper understanding, resulting in an engaging and enjoyable learning experience and heightened satisfaction [53]. Fourthly, the appropriate use of humor by instructors creates a relaxed atmosphere, alleviating tension, fostering rapport, and establishing a positive emotional connection with learners. These positive emotions contribute to a more satisfying learning experience. Fifthly, effective presentation skills, such as clear communication, well-organized lectures, and the skillful use of visual aids, significantly impact learner satisfaction. Instructors who deliver information coherently and engagingly facilitate the comprehension of complex concepts and maintain student interest. Additionally, accommodating diverse learning styles through adaptable teaching methods creates an inclusive and effective learning environment [54]. By catering to individual needs and employing various instructional approaches, instructors enhance learner satisfaction by promoting understanding and practical application of course material.

Overall, the “faculty” criterion encompasses various factors that directly influence instructional quality and the overall learning experience. Fulfilling these criteria enhances learner engagement, motivation, and satisfaction in MOOCs.

6.2.3 Assessment

In terms of the criterion “assessment”, its weight is 16.19% in D2, while it constitutes only 9.24% of the evaluation process in D1. In the context of MOOCs, the significance of assessment is widely acknowledged. For example, Bali [55] proposed that MOOC assessment provides learners with opportunities to evaluate their acquired knowledge and apply their learned skills. Likewise, Hew [43] emphasized the significance of interactive learning and applying knowledge to improve engagement. The correlation between motivation and evaluation is supported by Zimmerman [56], who discovered that individuals with particular learning objectives tend to assess their performance. Scholars have also observed that providing self-assessment tools to learners aids in developing critical thinking abilities [57].

Especially in Computer Science, Engineering, and Programming MOOCs, learners often discuss assessment-related matters. This aligns with the considerable interest among researchers in exploring effective assessment strategies to enhance computer science and programming education [58]. Although auto-graders in MOOCs have improved fairness and efficiency, challenges remain, such as potentially inaccurate auto-grading and a lack of personalization. To optimize programming education, it is suggested to support inline-anchored discussions addressing auto-grader complaints and integrate program visualization tools that demonstrate how auto-graders work into forums. Additionally, the assessment features characteristic of MOOCs, such as assignments and quizzes, still fall short of meeting the requirements of effective computer science and programming instruction. To foster proficient programming skills, learners must receive timely and accurate feedback on their solutions and answers after completing practical programming exercises and real-world programming tasks.

6.2.4 Enjoyment

Concerning the criterion “enjoyment” its weight is 15.81% in D2, while constituting only 2.85% of the evaluation process in D1. Cognitive Evaluation Theory suggests that individuals’ actions are shaped by internal components, emphasizing the joy and contentment derived from engaging in behavior [59]. Enjoyment, referring to “the extent to which the activity of using a product is perceived to be enjoyable in its own right, apart from any performance consequences that may be anticipated (p. 1113)” [60], is considered an inherent and emotional concept that impacts teachers’ acceptance. This concept comprises two sub-components: pleasure and satisfaction, gauging teachers’ internal drive across various levels of necessities.

Pleasure represents the initial level of positive mental states learners experience in learning [33]. Although MOOCs are not primarily for pleasure-driven intentions, learners may find it interesting, thus motivating continued adoption. Satisfaction is an emotional reaction resulting from achieving a certain objective after usage [61]. When learners are satisfied with the learning outcomes from a MOOC, they have higher intentions for continued use. Howarth et al. [62] found a positive correlation between current learners’ satisfaction and technology adoption by new learners.

In Computer Science, Engineering, and Programming MOOCs, enjoyment plays a pivotal role for two reasons. First, these fields demand high levels of cognitive effort and problem-solving skills. When learners find the activities and learning experiences intrinsically enjoyable, it enhances their motivation to engage with the MOOCs. Enjoyment sustains their interest, curiosity, and enthusiasm, fostering higher levels of engagement and adoption. Second, Computer Science, Engineering, and Programming involve complex concepts, abstract thinking, and technical skills. The learning process can be demanding and require significant effort. Enjoyment serves as a positive emotional response that helps alleviate the perceived difficulty and challenges associated with these subjects. When learners derive joy and pleasure from the learning activities, they are more likely to persist through challenges, remain motivated, and continue adopting MOOCs.

6.2.5 Usefulness

Regarding usefulness, it constitutes 30.65% of the evaluation process in D1, while accounting for only 9.98% in D2. Utility, characterized as an external and mental concept, refers to the total worth a user attributes to employing a new technology [38]. Through our topic modeling analysis results, in the context of MOOCs, the utility has three sub-criteria: knowledge enhancement, job preparedness, and beginner friendliness. Among these, job preparedness is exclusively present in D1, making up 16.81% of the evaluation process. This somewhat supports [63], who identified job promotion needs and curiosity as primary factors influencing learners’ enrollment in a MOOC. However, motives for MOOC enrollment can vary across different course types. For example, according to a survey by Bayeck [64], 74.6% and 11.9% of respondents enrolled in humanities courses (poetry and music, respectively) due to curiosity and seeking job performance improvement. For MOOCs related to science, health science, and math, only 39% of enrollments were motivated by skill improvement to perform better in jobs.

For learners taking Art, Design, and Humanities MOOCs, potential explanations for their frequent concerns about job preparedness are as follows. Firstly, learners in Art, Design, and Humanities disciplines often prioritize acquiring practical skills that directly apply to their work. They seek assurance that the courses they undertake will equip them with relevant and applicable skills sought after in the job market. It is essential to connect theoretical understanding with real-world implementation, ensuring that the courses they invest time and effort in will have a tangible impact on their professional growth. Secondly, the Art, Design, and Humanities industries are diverse and encompass various sectors, each with its unique requirements and expectations. Learners are concerned about whether the MOOCs they choose align with their career interests and goals. They want to ensure that the course content, projects, and assignments offered reflect current industry practices. The relevance of the course content to their chosen field is crucial for learners to feel confident in job preparedness.

6.3 RQ3: What are the ratings of courses and how do they rank according to sub-criteria and criteria?

The overall rankings of course alternatives in selecting Art, Design, and Humanities as well as Computer Science, Engineering, and Programming courses, as presented in Tables 6 and 9, do not precisely match the overall course ratings from the Class Central website (Tables 2 and 3). There are similarities and differences between them. For instance, according to the reviews’ overall ratings about Art, Design, and Humanities courses in column 3 of Table 2, A2, A5, and A8 are rated as 4.7, which is the lowest among the nine courses. However, our proposed model, as shown in column 2 of Table 6, ranks A5 and A8 in the seventh and eighth positions, respectively, while it places A2 in the second position. Similarly, for the Computer Science, Engineering, and Programming courses, based on the reviews’ overall ratings in column 3 of Table 3, B2 and B8 receive the highest rating of 4.9 among the nine courses. Nevertheless, our proposed model, displayed in column 2 of Table 9, ranks B8 in the second position and B2 in the sixth position.

The inconsistencies between our analysis results and the reviewers’ overall opinions about the courses may arise from the fact that our approach goes beyond merely considering rough scores given by individual learners. Instead, we employ a more detailed and fine-grained approach by also considering the weights and significance attached to each criterion, determined through group decision-making analysis. This allows us to gain insights into a single course’s rankings based on different criteria and even sub-criteria, which are not available on the Class Central website. Results presented in Table 5 and Tables 12, 13, 14, 15, 16, 17 and 18 in the “Appendix” for Art, Design, and Humanities course selection, and results presented in Table 8 and Tables 19, 20, 21, 22, 23, 24, 25 and 26 in the “Appendix” for Computer Science, Engineering, and Programming course selection, provide a more comprehensive understanding of how courses perform according to various criteria and sub-criteria. In contrast, the same overall course rating presented in Tables 2 and 3 treats all courses with the same overall rating as equal.

For example, in the case of Art, Design, and Humanities, we ranked “English for the Workplace” (A3) as the best course. By checking Table 5 and Tables 12, 13, 14, 15, 16, 17 and 18 in the “Appendix”, we find that A3 is also ranked the first among all alternatives according to four criteria: “assessment”, “content”, “effort”, and “usefulness”. The weights of these four criteria, from Table 6, are approximately 9.24%, 9.96%, 12.10%, and 30.64%, respectively, collectively constituting about 61.95% of the evaluation process. Therefore, we ranked “English for the Workplace” in the first place. For the other four criteria: “enjoyment”, “faculty”, “interaction”, and “structure”, the courses ranked the first among all alternatives were “English in Early Childhood: Language Learning and Development” (A4), “A Life of Happiness and Fulfillment” (A1), and “Academic Writing” (A2), respectively. Although both A1 and A4 are rated 4.7 on the Class Central website, our analysis shows that they perform differently based on different evaluation criteria. By having this knowledge, learners can select courses with better performance in the criteria they value the most. Therefore, implementing the AHP enables us to distinguish between courses that have the same overall course rating obtained from the Class Central website.

6.4 Implications and suggestions

The results of this investigation illuminate the factors that influence the selection of MOOC courses and offer valuable perspectives for instructors and researchers to optimize course design, instructor involvement, and learner satisfaction in MOOCs.

  1. 1.

    Course design and presentation

    Instructors of MOOCs should create courses that go beyond being informative and are also captivating, flexible, and well-structured. Moreover, for skill-oriented courses, integrating technical aspects like user-friendliness and adaptability is crucial to enhance learners’ experiences.

  2. 2.

    Instructor qualities and engagement

    Instructors should cultivate a positive learning environment and motivate learners through their expertise, experience, and enthusiasm. Incorporating humor into instruction can enhance social presence and emotional connection. They should deliver clear, well-organized lectures, effectively utilize visual aids, and embrace adaptable teaching methods catering to diverse learning styles.

  3. 3.

    Assessment strategies

    Instructors need to carefully consider assessment methods, particularly in skill-oriented courses where auto-graders play a vital role in providing timely feedback and increasing satisfaction. However, they must address potential issues of inaccurate grading and lack of personalization. Introducing program visualization tools and fostering discussions can be beneficial. Assessments should promote active learning and self-evaluation to foster critical thinking skills.

  4. 4.

    Promoting enjoyment

    Instructors should prioritize creating enjoyable learning experiences, especially in skill-oriented MOOCs like Computer Science, Engineering, and Programming, which can be challenging. Enjoyment positively impacts learners’ emotions, reduces perceived difficulty, and encourages engagement. Interactive and engaging activities enhance motivation and satisfaction.

  5. 5.

    Addressing job preparedness

    In knowledge-oriented MOOCs like Art, Design, and Humanities, addressing learners’ concerns about job preparedness is crucial. Instructors should ensure courses offer practical and applicable skills aligned with learners’ career interests. Demonstrating the relevance of course content to specific industries and job requirements boosts learners’ confidence and satisfaction.

  6. 6.

    Enhancing learner engagement

    MOOC platforms should provide personalized course recommendations based on individual preferences and learning styles to enhance engagement. Tailoring courses to meet learners’ valued criteria improves satisfaction and retention. Additionally, allowing learners to evaluate courses based on different criteria facilitates informed decision-making.

  7. 7.

    Leveraging multi-criteria decision-making

    Researchers and MOOC platforms can employ the multi-criteria decision-making methodologies introduced in this study for comprehensive course evaluations. By considering overall course ratings and individual criteria weights, learners can make informed decisions aligned with their preferences. Researchers can refine the model by adding factors to improve evaluations.

By implementing these insights and recommendations, educators and online learning platforms can enhance the effectiveness and appeal of MOOC experiences, thereby promoting the advancement and enhancement of online education.

6.5 Reflections on research methodologies

This study used courses with an overall rating score of 4.7 or higher based on observing user behaviors on real-world online platforms. Typically, users gravitate towards courses with higher overall ratings when making selections. The challenge lies in the limited information displayed—often only the overall course ratings are visible on the MOOC website. Despite the presence of numerous textual reviews from previous learners, manually evaluating each comment becomes a time-consuming task. Additionally, learners have diverse priorities, some favoring assessment aspects while others value interaction. This diversity contributes to the difficulty users face in objectively evaluating and selecting courses based on the available information. Our method aims to encourage users to delve deeper into the evaluation process for the initially selected highly rated courses. By providing more granular evaluation criteria and insights into various performance dimensions, learners can identify courses that align closely with their preferences. Thus, we specifically focused on comparative analysis among courses with high overall rating scores across different dimensions, mirroring real-world user decision-making scenarios. This is more practical and meaningful, and aligns more closely with user needs during the course selection process, particularly when dealing with popular courses exhibiting minimal differences in overall ratings. However, it would be intriguing in future endeavors to include a larger range of overall rating scores of courses for comparison.

Regarding the predetermined labels used for guiding the labeling process of topics, while we did not explicitly create standalone labels for “videos” and “instructors” that are frequently used in previous studies (e.g., [2, 32]), we have covered various dimensions pertinent to instructors and videos within the “technicality” and “faculty” labels. Specifically, the structuring of the “technicality” label, with its comprehensive sub-labels, was determined to encompass and evaluate the essential elements pertaining to technological factors including video content in MOOCs. By analyzing the technical intricacies (complexity) and adaptive features (flexibility) of MOOC videos, we ensured a holistic evaluation that inherently included the pivotal role and influence of video content on the overall quality of MOOC courses. Furthermore, the structuring of the “faculty” label, with its detailed sub-labels, was utilized to thoroughly encompass the multiple dimensions associated with instructors in MOOC courses. Our approach aimed at holistically evaluating the influence of instructors by dissecting various aspects that collectively contribute to the overall perception of faculty quality within the MOOC setting. By employing the “faculty” label and its sub-labels, we aimed to cover a broad spectrum of factors that directly relate to instructors’ effectiveness, teaching styles, and collaborative dynamics within the teaching team, thereby providing a comprehensive evaluation. Nevertheless, future work can consider proposing more fine-grained methods, for example, the Delphi method by involving multiple rounds of questionnaire surveys to expert groups, to determine predetermined labels.

Regarding the manual analysis of the algorithm-identified topics, this is a prevalent method in topic modeling studies (e.g., [16, 65]). This study rigorously adhered to a widely adopted procedure from previous studies (e.g., [34, 35]). Specifically, this study’s interpretation of topics was guided by sociotechnical system theory, Moore’s transactional distance theory, and MOOC instructional design practices. To enhance precision, our methodology entailed multiple rounds of in-depth discussions among two experienced MOOC education experts until a consensus was reached. While involving two annotators is generally considered to be able to ensure a higher level of accuracy and reliability in the labeling process in previous topic modeling studies (e.g., [35, 66]), we acknowledge the potential for increased rigor by involving more annotators. Moreover, future work might consider integrating STM outcomes with automated methods for topic naming, thereby improving the interpretative process’s accuracy and efficiency.

Regarding the determination of models based on exclusivity and coherence scores across a range of 5 to 50 topics, this approach might appear simplistic in finding the most suitable number of topics. However, exclusivity and coherence measures are widely considered effective methods in previous topic modeling studies (e.g., [65, 67]) to define the ideal number of topics. These measures serve as heuristics by providing indicators or scoring systems that help evaluate the quality or coherence of topics generated by models. They aid in the selection of an optimal number of topics by assessing the semantic coherence within topics and the distinctiveness between them [68]. Nevertheless, future work can consider involving and combining more advanced heuristic-driven methodologies such as perplexity metrics and potentially exploring Bayesian approaches to achieve a more refined and accurate identification of the ideal number of topics.

Regarding the computational complexity, the optimization of topic models entails the refinement of parameters and the selection of an appropriate model configuration. Specifically, a substantial amount of computing resources is dedicated to determining the most suitable models based on exclusivity and coherence scores across a range of topics (from 5 to 50) within the proposed method, requiring hours for processing databases D1 and D2. The computation of exclusivity and coherence scores for various topic numbers involves running multiple iterations of the topic modeling algorithm for each specified number of topics. This iterative process is computationally intensive, in comparison to other steps in the proposed method, for example, the automatic ranking of a sample of courses in various criteria, which can be achieved within seconds. Nevertheless, the overall computational time for the proposed method is much shorter than using manual coding. However, as dataset sizes expand, encompassing larger vocabularies and broader ranges of topics, the time and resources for model training and inference will also increase, potentially leading to scalability concerns. Thus, future endeavors may contemplate implementing optimization techniques within scoring algorithms or adopting more computationally efficient approaches for computing exclusivity and coherence scores to alleviate the computational burden inherent in the overall modeling process.

Overall, the proposed methodology in this study enables effective and efficient analysis of extensive textual data, a stark contrast to manual analysis methods. However, we recognize the limitation compared to a “true close reading” of texts. Yet, automated text analysis offers tremendous potential for educational researchers to grapple with vast volumes of data, facilitating the exploration of MOOC learner-generated texts.

6.6 Limitations and future work

This study has limitations. Specifically, the review data was exclusively obtained from the Class Central, a well-known MOOC aggregator website. While the Class Central is esteemed, incorporating data from diverse MOOC platforms such as EdX, Coursera, self-developed online learning platforms, and even open-ended MOOC surveys from learners and instructors would offer additional validation and complement the study’s findings. The present study solely concentrated on Art, Design, and Humanities as well as Computer Science, Engineering, and Programming MOOCs as representatives of knowledge- and skill-seeking courses. To enhance the applicability and generalization of the findings, it is imperative to include MOOCs from a broader spectrum of courses, thereby augmenting the model’s effectiveness. Since courses from different disciplines may display unique characteristics.

7 Conclusion

This study, introduced a novel approach to automatically evaluate MOOCs using a multi-criteria decision-making model, combining text mining and AHP. The research analyzed reviews from both knowledge-seeking MOOCs, such as Art, Design, and Humanities, and skill-seeking MOOCs, such as Computer Science, Engineering, and Programming courses. Common criteria prioritized by both types of learners include “assessment”, “content”, “effort”, “usefulness”, “enjoyment”, “faculty”, “interaction”, and “structure”. Additionally, skill-seeking learners also emphasized technicality, encompassing ease of use and flexibility. For both knowledge-seeking and skill-seeking learners, effort and faculty were deemed crucial. However, skill-seeking learners placed more importance on “assessment” and “enjoyment”, while knowledge-seeking learners valued “usefulness” higher. The research demonstrated the effectiveness of employing online learner reviews and topic modeling for automated MOOC evaluation.

Reflecting on the research design of this study, several issues should be noted: (1) use of data related only to Art, Design, Humanities, Computer Science, Engineering, and Programming courses with an overall rating score of 4.7 or higher from Class Central, (2) determination of predetermined labels based on a review of literature, (3) manual analysis of the algorithm-identified topics, and (4) determination of models based on exclusivity and coherence scores across a range of 5 to 50 topics. Future work may pay attention to the following issues. First, including a larger range of overall rating scores of courses from different disciplines and diverse MOOC platforms. Second, proposing more fine-grained methods such as the Delphi method to determine predetermined labels. Third, integrating STM outcomes with automated methods for topic naming. Fourth, combining more advanced heuristic-driven methodologies such as perplexity metrics and potentially exploring Bayesian approaches to determine the ideal number of topics. Additionally, it is also vital to acknowledge that latent topics within course reviews may evolve over time. Thus, future research could focus on developing tools capable of providing real-time analyses of the topics relevant to MOOC learners.

In sum, the proposed automated text analysis methodology in this study offers tremendous potential for educational researchers’ effective and efficient analysis of vast volumes of textual data to shed light on course evaluation and selection. The findings of this study have significant implications for optimizing course design, enhancing instructor engagement, and improving learner satisfaction. MOOC platforms can leverage this model to provide personalized course recommendations, ultimately contributing to the advancement of online education and enhancing the MOOC learning experience for learners.