1 Introduction

With the spread of COVID-19, the teaching and learning format has changed from face-to-face to online one. For this reason, massive open online courses (MOOCs) have received increased attention (Mays et al., 2021), especially in universities. MOOCs are mostly free higher education courses delivered through the Internet and multimedia tools. Through MOOCs, thousands of students can freely learn and share the courses they are interested in. The three major MOOC platforms are: Coursera, Edx and Udacity. In the Chinese-speaking community, Xuetang is the most popular MOOC platform (Hou, 2016). In general, MOOC platforms include the following three components: course content, community building tools, and platform tools. Of particular interest to this study is the course content component. It includes informational and interactive assets (Fauvel & Yu, 2016).

Despite tremendous success of MOOCs, there are a number of issues associated with MOOCs implementation. The most frequently cited issues are high dropout rate, low participation, and low quality of some MOOCs (AlQaidoom & Shah, 2020). In order to promote MOOCs, these issues need to be identified and addressed as early as possible.

In MOOCs, teachers are separated from students and students are separated from each other. For this reason, their online communication and interaction are very important for courses’ success (Alqaidoom & Shah, 2020). Constructivism learning theory holds that knowledge is not directly taught by teachers, but is constructed by students’ during in-depth discussion with teachers and/or other students (Moon et al., 2021). Sociological theories state that learning is an interactive process in nature. Therefore, interaction among MOOCs participants is important. For example, efficient interaction not only can promote the acquisition of new knowledge, but also improves the level of interaction (Tokur & Akgün, 2021). Research shows that online discussion is the core of online courses and can arouse students’ positive perceptions and attitudes about learning in online mode (Guo et al., 2021).

Scholars often used such research methods as content analysis or questionnaire in their studies on online discussion, and they are usually based on a specific framework, e.g. the cognitive level, knowledge construction (Duruk et al., 2021; Hwang et al., 2018), critical thinking (Warsah et al., 2021) or community of inquiry (Garrison, 1991). Previously, content analysis was often carried out manually, requiring a lot of labor and money. In addition, only part of the data could be extracted for the content analysis. Questionnaire is easier to administer, arranged at the end of the study, and aims to evaluate students’ perceptions of their thinking level in online discussions.

Advances in technology, especially data mining, make it possible to automate data processing on MOOCs. Such approach to analyze discussion content of MOOCs has attracted attention of many researchers. Among them, a research area that has gained significant popularity is how artificial intelligence and data mining can contribute to better understanding of the MOOC ecosystem and to improving it.

For example, Almatrafi et al. (2018) built a predictive model based on machine learning algorithms to identify an emergency discussion thread. This model can help teachers effectively navigate a bulletin board system (BBS) and assess posts with some priorities, which is of great help to effectively support teaching and learning and reduce students’ dropout rate. Qin (2017) divided the discussion posts into such categories as “confusion,” “explanation” and “irrelevant,” and trained the support vector machine (SVM) classification model to predict them. Results of the above-mentioned studies demonstrated that applications of data mining were beneficial for MOOC teaching and learning. Therefore, on the basis of the above-mentioned research, this paper intends to establish a mathematical model for automatic prediction of in-depth discussion of learning topics in MOOC forum. The prediction model of in-depth discussion distinguishes the content of MOOC discussion forum, provides teachers with guidance on the content of discussion forum, and encourages students to think deeply about assigned problems.

More specifically, the research problem in this paper covers the following two aspects:

  1. (1)

    To construct the automatic prediction model of the depth of discussion of learning topics;

  2. (2)

    To explore the predicted variables of the depth of discussion of learning topics.

2 The influencing factors of the depth of online discussion

Online discussion is a complex learning activity based on online interaction and involves many influencing factors. Mazzolini and Maddison (2007) analyzed discussion posts in 400 course forums and found that teacher participation rate, time of posting (i.e. during or at the end of the forum), and nature of posting (i.e. question, answer, or combination of both) influenced students’ participation in online discussion. Based on the question perspective, Zhang and Si (2010) found that the speed of feedback to the question response, the type of question (cognitive-memory questions, convergent questions, divergent questions), and the presentation of question are all important factors in online discussion.

As for the depth of online discussion, Huang (2017) believed that the type of topic (related to / unrelated to teaching content), the number of postings and the size of groups were important variables to influence how deep participants discussed learning content. Similarly, the research by Spatariu et al. (2007) obtained similar results. For example, Spatariu et al. (2007) also thought that the depth of online discussion can be influenced by information type, cooperative learning structure and type (e.g. structured or not, role playing, etc.) and open questions. Besides, teaching assistant, learning support (e.g. designated discussion facilitator, facilitator questioning skills training, etc.) and group structure had impacts on students’ critical thinking and their discussion depth. So, the depth of students’ online discussion mainly depends on whether teachers provide feedback in time (Nandi et al., 2012) and the teachers’ opinions and tendencies towards online discussion activities (i.e. the importance of online discussion) (Gerber et al., 2005). Besides, scholars suggested that the questions related to high cognitive levels also promote in-depth discussion online (Ertmer et al., 2011).

In summary, the influencing factors of the depth of online discussion are active participation of students, timely feedback from teachers, online interactive guidance, open discussion environment, discussion time limit, and challenging discussion tasks. These variables also determine the initiative of learners to participate in online discussion (Choi et al., 2005). Therefore, this study selects research variables according to the above literature.

3 Learning topics and discussion depth in MOOC

Communication and active participation are two crucial determinants in the quality of teaching and learning (Durairaj & Umar, 2015), especially in online learning environments. According to the social development theory (Vygotsky, 1987), the cognitive development of a learner depends on the social environment in which he/she is engaged in knowledge construction process through interaction with peers and the instructor. These interactions facilitate the acquisition, exchange and creation of knowledge together as well as sharing of experiences in the instructor-learner and learner-learner relationships (Tan, 2017).

MOOCs provide students with the opportunity to engage in discussion such as to talk about current coursework, to ask questions about content they are unclear about or to offer assistance to students with lower learning ability. In addition, the lecturer and teaching assistant visit the discussion forum to answer questions from students and provide feedback. Therefore, it is important for students to have active discussions on various learning topics in MOOC forum.

Learning topics in discussion forum may vary, for example, participants can discuss different types of questions of their MOOC, important concepts, participants’ responses or feedback. Some of these topics are ephemeral, others are long-lasting. Depth of discussion refers to the level of online discussion such as very thorough discussion of a topic when most of its aspect are considered. Topic discussion is considered as the process of knowledge generation through exchanging of ideas related to a topic. In deep discussion, students need to clearly understand a topic and associated problem, and be able to critically check the logic of arguments (Wang, 2013). The depth of online discussion in this study is defined as students’ integration of new knowledge into their original cognitive structure during the MOOC discussion. On this basis, new and existing knowledge is integrated and continuously applied to new learning situations; this process results in a solution to assigned problems and promotion of new knowledge construction. We classified the depth of learning topic discussion in MOOC forum based on Bloom’s taxonomy (Anderson & Krathwohl, 2001). Cognitive process dimensions of the taxonomy are remember, understand, apply, analyze, evaluate and create (Anderson & Krathwohl, 2001). So, the present study ranked topic discussion level from low level to high level with such categories as shallow, medium and deep respectively. The topic discussion depth in MOOC forum is shown in Table 1.

Table 1 Topic discussion depth in MOOC forum based on Bloom’s taxonomy (Anderson & Krathwohl, 2001)

Based on the review of above-mentioned literature, this study intends to establish a mathematical model for automatic prediction of the depth of discussion of learning topics in MOOC forum.

4 Related work on learning prediction model

4.1 Feature selection

In order to use the machine learning method, most of the data need to be obtained from realistic scenarios, appropriate features need to be extracted from the data, and then they need to be represented as feature vectors. The original data set generally includes many redundant features or irrelevant features, which increase the dimension of the data and the complexity of the algorithm model. Feature selection is a process that finds the most effective features from the original feature set, reduces the dimension of the feature set and improves the classification accuracy under the premise of ensuring the invariance of similar samples, the distinguishability of different samples and noise robustness (He, 2015). Researchers used various approaches for feature selection. For example, Dash and Liu (1997) proposed a basic framework for feature selection which is composed of four steps: generation, evaluation, stopping criterion and validation. In our study, we propose a model to predict the discussion depth of learning topics in MOOC forums based on this basic framework.

4.2 Logistic regression algorithm

The prediction of topic discussion depth in MOOC forums can be regarded as a classification problem based on machine learning. Logistic regression algorithm is a generalized linear regression analysis method, which can obtain the weight of independent variables, and thus, understands each attribute. The logistic regression algorithm supports two-class and multi-class classification problems. For two-class classification problems, the formula is as follows (Eqs. 1 and 2).

$$P(Y = 1|X) = \frac{\exp (w \cdot x)}{{1 + \exp (w \cdot x)}}$$
(1)
$$P(Y = 0|X) = \frac{1}{1 + \exp (w \cdot x)}$$
(2)

\(x = (x_{1} ,x_{2} , \cdots ,x_{d} ,1) \in R^{d + 1}\) is the feature vector inputted, d is the feature dimension of the sample, \(y = \{ 0,1\}\) represents the target label outputted, \(w = (w_{1} ,w_{2} , \cdots ,w_{d} ,b) \in R^{d + 1}\) is the weight vector and b is the offset.

The logistic regression algorithm estimates the coefficients of the regression equation by the maximum likelihood estimation, which can be solved by Newton iterative method and gradient descent method. The loss function is shown in Eq. 3.

$$L(w,x,y) = \log (1 + \exp ( - yw^{T} x))$$
(3)

The goal of the logistic regression algorithm is to find the optimal weight vector w according to the sample distribution of the training data. Given a new sample point x, the logistic regression is predicted by the following loss function of Eq. 4.

$$f(z) = \frac{1}{{1 + e^{ - z} }}$$
(4)

Among this, \(z = w^{T} x\). The output of the logistic regression algorithm \(f(w^{T} x)\) has the function of probabilistic interpretation. By default, the sample is predicted to be a positive class when \(f(w^{T} x) > 0.5\), and vice versa.

4.3 Model prediction testing

The classification ability of the model to test the data is usually measured by accuracy, precision, recall and F value which are calculated by the constructed confusion matrix, as shown in Table 2.

Table 2 Predicted class matrix

The accuracy refers to the proportion of the number of samples correctly predicted by the model in the overall sample, which can be calculated by Eq. 5. The precision refers to the proportion of the number of samples that is actually positive and the prediction of which is also positive in samples; it can be calculated by Eq. 6. The recall refers to the proportion of the number of samples predicted to be positive by the model in the actually positive samples, which can be calculated by Eq. 7. The accuracy and the recall are mutually independent and a single prediction criterion cannot reflect the validation of a classification model. However, the F value can make up for this defect which is the weighted harmonic average of the precision and the recall, as shown in Eq. 8.

$$Accuracy = \frac{TP + TN}{C}$$
(5)
$$Precision = \frac{TP}{{TP + FP}}$$
(6)
$$\textit{Re}call=\frac{TP}P$$
(7)
$$FI=\frac{2\textit{Re}call\times Precision}{\textit{Re}call+Precision}$$
(8)

4.4 Forum posts and prediction model

Posts are presented as discourse in discussion forum. Information exchange and knowledge construction are based on series of participant discussion behaviors. There is a lot of research on teaching and learning-based forum data. For example, Garrison et al. (1999) developed coding framework for learning reflection in online discussion. Babori (2021) focused on exploring learning strategies in MOOCs. Hew and Cheung (2008) compared different types of questions in discussion forum and found that questions about viewpoints were regarded as a more useful technique for encouraging participation than questions for clarification. Bradley et al. (2008) examined how question types influence the quantity and quality of the discussion and determined that a limited focal question type, followed by brainstorming, open focal (an issue with no alternatives) and direct link (such as a quotation) question types were the most influential for increasing word count and the degree of answer completion.

As these data are generated, predictive models are also studied by researchers. Predictive model refers to the quantitative relationship between variables described by mathematical language or formula and is used for prediction of their relationships. It reveals the inner regularity of variables to a certain extent and takes it as the direct basis for calculating a value in prediction. Among them, the regression prediction method is based on the correlation between independent variables and dependent variables to predict relationships among these variables. Therefore, on the basis of the above works, a learning outcome prediction model was established in the present study based on the data such as posts in MOOC discussion forum and their different types.

5 Data preparation

5.1 Data source

Currently, there is no authoritative standard data set available in the research on the mining of posts related to learning topics in MOOC forum. We collected posts on learning topics from a MOOC platform for Chinese universities; it is a platform for online course learning with a large amount of data. For this reason, the requirement related to a statement about the ethics issues relevant to the research and approvals under which the data was collected and reported is not relevant for the present study.

The discussion area in the platform is generally composed of three sections such as classroom communication area, comprehensive discussion area and question and answer area for teachers, with each having different functions. Students enter a corresponding discussion board to ask questions according to their own needs. Other students and teachers discuss and communicate with each other according to discussion questions and get solutions to assigned problems. Of the 200 courses of this type, Modern Educational Technology course offered by the Fu Gangshan Team of Shaanxi Normal University in China has the highest number of participants. Moreover, this course has been recognized as a “Chinese national top quality” course with rich topics, wide coverage, high learner enthusiasm and good representativeness.

This course has been delivered since February 2016 seven times. Among them, the course delivered third time (third session) was the most complete, including course learning, course management, platform operation and daily communication. So in this study, the data from the third session was used as the research sample. A total of 11,184 participants participated in the third session and 3,680 discussion posts were posted. The content related to learning in the forum mainly consists of three aspects: the first is the question, the second is the knowledge, and the third is the feedback. The question includes type of questions and definition of questions, the knowledge consists of knowledge types and importance of knowledge points, and the feedback is divided into duration of questions, number of questions answered, number of views, number of votes, question–answer feedback, and whether the teacher/teaching assistant participated in the discussion. Among them, the discussion with question solving is mainly based on question–answer feedback. So, these variables are taken as independent variables and dependent variable is the discussion depth.

An application using Selenium with Python was developed in order to capture all posts in discussion forum from the Modern Educational Technology course. Collected metadata included such information as topic title, topic description, topic comment content, comment reply content, topic creation time, last comment reply time, topic reply number, topic browse number, topic vote number, topic initiator ID, and comment reply ID. All information was stored in a data sheet.

5.2 Data preprocessing

In this study, the data preprocessing method included (1) screening against irrelevant or duplicate data and eliminating it, (2) coding variables and (3) data standardization. The specific steps and rules are listed and explained below.

5.2.1 Data screening and deduplication

2,344 threads were obtained and they contained a total of 18,219 posts. Themes of the forum include course learning, course management, platform operation and daily communication. Among them, the content related to learning is as follows: professional knowledge, test assignments, learning methods, exam review and learning experience. Therefore, only MOOC learning related threads were analyzed in this study.

Next, the threads had to be further cleaned. Anonymous threads were deleted because the variable “question–answer feedback” could not be counted; threads with no reply were also deleted; one of the threads with the same question description and content of reply was reserved; the feature variable “views” was removed because its value was similar as the reply number. Finally, 1,873 threads were reserved, including 8 feature variables.

5.2.2 Variable coding

The variable coding is described as follows:

  • Question types: In a mathematics classroom research, Song and Wang (2006) classified classroom questions into: suggestive questions, open questions, guiding questions, understanding questions, evocation questions and judgment questions from the metacognitive and cognitive perspective. However, the judgment question in this question type is only a short answer to what is right or wrong and is inappropriate for the topic characteristics of MOOC discussion. Therefore, this study deleted this category. See Table 3 for detailed description of the classification of questions in this study.

    Table 3 Question types (adopted from Song and Wang (2006))
  • Knowledge types: (a) the categories of knowledge points (declarative knowledge = 0, productive knowledge = 1, strategic knowledge = 2); and (b) the importance of knowledge points (important point = 1, general point = 0).

  • Question organization: we categorized the organization of questions in MOOC forums. Based on this, questions can be divided into two categories: well-structured question and ill-structured question. Here, well-structured question = 1, ill-structured question = 2.

  • Question duration: it is expressed by the interval between the last post time and the release time of question. Based on the actual situation, we divided the duration of questions into 3 categories: within a day = 1, within a week = 2, more than a week = 3.

  • Question–answer feedback: the person who asked the question provided feedback to students’ answers or guided other students to discuss an issue in deep after initiating a discussion. Among them, no feedback = 0, one feedback = 1, multiple feedbacks = 2.

  • Teacher participation: teacher or teaching assistant participation = 1, no teacher or teaching assistant participation = 2.

  • Number of replies: the number of answers to each question in MOOC forums. It is a continuous variable.

  • Topic discussion depth: based on the cognitive process dimensions of Bloom’s taxonomy, the contents of discussion was analyzed to confirm the discussion depth in MOOC forum. The cognitive level reflected in discussion was divided into 3 levels: shallow = 1, middle = 2, deep = 3. Table 1 demonstrates the classification of depth dimensions and Table 4 includes the description of topic discussion depth in MOOC forums.

    Table 4 Topic discussion depth in MOOC forums

To ensure the validity and reliability of the coding, the study involved three experts in the data coding process. The three experts were all with PhD degrees and had long been engaged in teaching and research in the field of educational technology. Their age was as follows: one was 52 and the other two were in their 40 s.

During the variable coding process, the experts worked with posts in 1873 topic categories (except the duration and issues hosted two variables). An independent coding by each expert was carried out and then coding consistency was inspected among coding results from the three experts. To this end, Kappa coefficient was used and the value of 0.791 (the level of consistency between 0.61 and 0.80 is considered as high) was obtained suggesting that high consistency of coding process was reached.

5.2.3 Data normalization

Data normalization was carried out to scale the value of a column of numerical features (assumed to be the i-th column) in the training set to a state with a mean of 0 and a variance of 1. In our study, variable “number of replies” with a large range of data was normalized, as shown in Eq. 9.

$$x_{i} * = \frac{{x_{i} - u}}{sd\left( x \right)}$$
(9)

5.2.4 Variables selection

There are 10 types of the discussion posts: type of questions, definition of questions, knowledge types, importance of knowledge points, duration of questions, number of questions replied, number of views, number of votes, question–answer feedbacks, and whether the teacher/teaching assistant participated in the discussion. The number of views, number of votes and number of replies were merged, and the definition of questions and type of questions were also merged to form 8 variables. The independent variables were: question type, question organization, knowledge type, important knowledge point, question duration, number of replies, teacher participation and question–answer feedback. The dependent variable was topic discussion depth.

5.2.5 Feature variables selection

After data preprocessing, a total of 1482 posts were obtained, each contained nine attributes. The first eight attributes were influencing factors for online discussion depth, i.e. input variables of prediction model, while the last attribute (discussion depth) was the output variable of the model.

The selection of feature variables in this study is mainly based on the above references, see Table 5 for the details of key literature and Table 6 for the variables.

Table 5 Variables and key literature
Table 6 Description of variables

Table 7 shows the result of the correlation analysis of eight variables. There was a positive correlation between discussion depth and types of questions, number of replies, question–answer feedback, teacher participation, duration of questions and types of knowledge points. The correlation coefficients were small, indicating these eight variables were less collinear with each other. Besides, there was a negative correlation between discussion depth and X2, X4, so the two feature variables were removed and the other six variables were retained.

Table 7 The co-linear relation between variables

On this basis, the accuracy of the algorithm was tested by a five-fold cross-validation method. The data was equally divided into five parts, four of which were used as training data and the other was used as test data. The average accuracy of the results of the five experiments were used to represent the accuracy of the algorithm selection for two classification models. One model contains eight features and the other contains six features. The accuracy of these two models was calculated using Eq. 5 (mentioned earlier in the paper) and the calculation result is shown in Table 8.

Table 8 The accuracy of the two classification models

6 Prediction model building

6.1 Optimal parameters

Logistic regression was also a generalized linear regression, and linear classification algorithms can be formulated as:

The objective function \(f(w)\) consists of two parts from which one was the regularization term to control the complexity of the model and to prevent the model from overfitting the data, and the other part was the error loss function. In this study, the regularization parameter λ was set to 0.5, 1, 1.5, and each time the parameter selection was done 5 times. The accuracy was used to measure the effectiveness of the model. The optimal combination of parameters obtained by the experiment was {‘penalty’: ‘l2’, ‘C’: 1.5}.

Based on the logistic regression algorithm, six feature variables were selected when the regularization parameter was 1.5. The feature importance ordering was as follows from high to low: X1, X5, X6, X8, X7, X3. See Fig. 1.

Fig. 1
figure 1

Feature importance ordering based on logistic regression

6.2 Regression models

Table 9 is a regression equation constructed with “depth of question discussion” as the dependent variable. This equation consists of 6 models.

Table 9 Regression models for dependent variables Y

In Model 1, the regression coefficient of “question types” was 0.435, p < 0.001 (strong). After adding the variable “question duration” to Model 2, the regression coefficient of “question types” decreased to 0.377 but the p value (p < 0.001) was still significant. This result shows that the depth of the question discussion is related to the types of question and the duration of the problem.

In Model 3, the regression coefficient of “question types” and “question duration” continued to decrease, but the regression coefficient of “number of replies” factor was 0.311 and the goodness of fit of the model increased to 38.1%. This indicates that the number of replies to questions is a significant influencing factor for the difference in the depth of question discussion. In Model 4, Model 5 and Model 6, three independent variables of “question–answer feedback,” “teacher participation” and “knowledge types” were included respectively. Finally, the regression coefficients of these three factors on the discussion depth of the question were 0.161, 0.151 and 0.072, respectively.

Although the regression coefficients of these three models on the “topic discussion depth” are lower than those of the previous three models, they still have a significant influence (p < 0.001), and the goodness of fit of the model increased. Overall, these six factors explained 43.8% of the difference in the depth of discussion.

6.3 Prediction formula and probability

Table 10 decomposes the discussion variable Y into different levels (i.e. deep, middle and shallow) for regression.

Table 10 The results of the regression model of 6 feature variables

From Table 10, we can get the following three formulas:

$$Y_{{1}} = - {0}{\text{.78}}X_{{1}} - {0}{\text{.35}}X_{{3}} - {0}{\text{.77}}X_{{5}} - {0}{\text{.71}}X_{{6}} - {0}{\text{.4}}X_{{7}} - {0}{\text{.41}}X_{{8}}$$
(10)
$$Y_{2} = 0.21X_{1} + 0.29X_{3} + 0.4X_{5} + 0.27X_{6} + 0.12X_{7} + 0.06X_{8}$$
(11)
$$Y_{3} = 0.87X_{1} - 0.08X_{3} + 0.57X_{5} + 0.84X_{6} + 0.53X_{7} + 0.4X_{8}$$
(12)

On the basis of the logistic regression algorithm, the method to predict the probability that a discussion topic reaches the three levels of discussion depth is to select the level with the most probability as the discussion depth. Therefore, the three Eqs. (10), (11) and (12) need to be transformed by the Softmax function. It maps the function values of Y1, Y2 and Y3 to the interval (0,1), and ensure the sum of these mapped values is 1. The transformation results are respectively as follows. See Eqs. (13), (14) and (15).

$$P_{Y1} = \frac{{e^{Y1} }}{{e^{Y1} + e^{Y2} + e^{Y3} }}$$
(13)
$$P_{Y2} = \frac{{e^{Y2} }}{{e^{Y1} + e^{Y2} + e^{Y3} }}$$
(14)
$$P_{Y3} = \frac{{e^{Y3} }}{{e^{Y1} + e^{Y2} + e^{Y3} }}$$
(15)

It is possible to predict the probability that a discussion topic reaches the three levels of discussion depth, thereby evaluating the role of the topic in learning or teaching. The assessment of the discussion topics is also valuable reference in the design of teaching and learning in the MOOC platform.

6.4 Verification of prediction models

Verification of the model in machine learning is mainly measured by precision, recall and F value, and the calculation is shown in the above formulas (5), (6), (7), (8). The interpretation of the meaning is shown in Table 2. The test results are shown in Table 11. The accuracy of the feature combination (X1, X5) was 67%; the accuracy of the feature combination (X1, X5, X6) was 65%; the accuracy of the feature combination (X1, X5, X6, X7) was 68%. The results of the models were compared, see Table 9.

Table 11 Verifying the results of variables in the discussion level (Y1, Y2, Y3)

Then, the types of questions were divided into arousal, comprehension, oriented, open-ended, and metacognitive. The depth of discussion was divided into shallow, middle and deep for measurement. The results are shown in Table 12. Among them, middle level of discussion is the most effective. Comprehension and oriented are two types of problems that work better. Among them, X2 = 910.881, p = 0.000.

Table 12 Verifying the question types and the discussion level (Y1, Y2, Y3)

7 Discussion and conclusion

In this study, we attempted to construct the automatic prediction model of the depth of discussion of learning topics. In addition, we aimed to explore the predicted variables of the depth of discussion of learning topics. Our findings showed that question types, question duration, and number of replies are the three most important variables that affect the prediction model of MOOC deep discussion. Furthermore, we found that the question–answer feedback and knowledge types also affect the prediction model. We discuss these aspects in more details in the following subsections.

7.1 Predictive models and the depth of online discussion

Big data analysis makes possible to automatically evaluate the depth of online discussion. With the help of online course learning data, this study establishes a mathematical model and predicts the depth of discussion questions. In MOOC forums, different question variables are the key to build prediction models. In this study, the important variable in the in-depth discussion is the question types. The prediction model in this study not only can predict the effective questions in the forum for MOOC builders, but also provide effective teaching help for MOOC users. Similarly, Almatrafi et al. (2018) used machine learning algorithms to construct a prediction model that can be used to identify urgent discussion posts that need to be paid attention to and responded to by teachers in MOOC discussion forums. Qin (2017) predicted the categories of discussion posts by training the SVM classification model, thus completing the emotion analysis of the course.

7.2 The question types and the depth of online discussion

The data indicate that question types have strong impacts on shallow, medium, and deep levels of discussion. According to Bloom’s taxonomy, arousal and understanding problems belong to the type of problems with low cognitive level. Therefore, most of online discussion posts were at the shallow level and the meta-cognitive question types could promote the depth of discussion better. This conclusion is consistent with the view of Nancy (2015). Scholars believed that Bloom’s taxonomy emphasizes the necessity of learning objectives (Nancy, 2015), and problems that require high level cognitive processes will lead to deep learning and efficient problem-solving. Table 12 shows the specific ratio between the question types and the level of discussion in MOOC discussions.

According to Table 12, in MOOC discussion, teachers are suggested to assign students such questions so that students have deep discussion of their question solution or reflection on question -solving process, e.g., open questions with controversial and inconclusive exploratory content. At the same time, teachers can set up reflective discussion topics. For example, discussion of meta-cognitive questions can give online students opportunities for self-reflection and self-assessment. Students on the basis of their analytical thinking, through debate and negotiation, can find questions as useful, and their engagement in discussion can promote their critical thinking development as well as achieve deep online discussion.

7.3 The frequency and duration of discussion and the depth of online discussion

The data showed that the discussion depth score of posts with fewer than seven replies was much lower than that of posts with more than seven replies. Posts with questions lasting more than seven days scored higher for discussion depth than posts with questions lasting less than a day. The reason is that the discussion area of this course has not formed a discussion environment for student to debate and negotiate, and there was a lack of consciousness of exchange and collision between students. The lack of questioning and integration of peer interaction lead to a small number of topic responses and the depth of online discussion was at the shallow level. A post with a short discussion duration couldn’t reach the level of in-depth discussion. It may be that students had not enough time to digest and understand what they had learned and had not yet assimilated it into their own knowledge system, so they lacked deep thinking about discussion topics.

In view of this situation, teachers can increase the scope of the discussion forum by encouraging more meaningful posts and discussion such as debating or negotiating around these posts among students. Teachers can take the number of students’ replies in the discussion forum as an important indicator of assessment, so that students can attach importance to the discussion forum. On the other hand, teachers can build learning communities where students can communicate with each other, share ideas, brainstorm on discussion issues, and achieve common goals. This not only can ensure students’ understanding and mastery of knowledge, but also promote the depth of their discussion.

7.4 The teachers’ guidance and the depth of online discussion

The teacher’s reply in the discussion forum of this course was mainly about the key and difficult points related to the course content, evaluation feedback, test evaluations and guidance. The leading role of teachers in MOOC affected the continuity and stability of students’ participation in learning to some extent (Zhou, 2016). In some posts, the initial discussion of students could not reach the expected effect, the focus of the discussion was not clear, and the topic of the discussion was scattered, which resulted in that most students could only have a superficial understanding of the discussion, and it was not easy to have further and deeper discussion. At this point, the teacher or teaching assistant can explain the content and key points of a response to a given question, so that students can have an in-depth discussion of a question.

Therefore, the leading role of teachers in forum discussions cannot be ignored. Teachers’ positive feedback and guidance play an important role in the in-depth discussion of issues in the forum, and in the process of cultivating students’ reflective ability and in-depth cognition. As Wang (2013) pointed out, the teacher guides the students to focus on discussion, uses the threading method to interweave the students’ viewpoints, and connects the students’ viewpoints to the learning results of the course.

7.5 The questioner’s support and the depth of online discussion

The discussion depth score of the discussion post with the questioner responding more than three times was much higher than that of the discussion post with no feedback or less than three feedbacks. Questioners’ timely feedback, effective guidance and recognition and praise for other students are important factors to stimulate students’ active learning and deep thinking, and important support to promote students’ in-depth discussion.

In such case, the course need to pay attention to process prediction. Process prediction is an activity in which students’ learning information is explained instantly and dynamically in teaching activities so as to reveal, judge and generate teaching value (Liu et al., 2020). Process prediction is constructed during implementation (Goodman et al., 1996). The questioner can give a timely comment on the content of the reply, such as setting the more insightful reply as “top” and “agree.” When there are different opinions in the discussion, the questioner may explain the discussion properly or hold his own position and engage in a debate. Students are also encouraged to collect and sort out the different viewpoints of the respondents, sublimate the discussion, inspire more students to think deeply about problems from different perspectives and ideas, and jointly construct knowledge. This information can be recorded in the student’s electronic file, and the student will be awarded a certain number of points, which will be taken into the total score. If the questioner posts only without any feedback, the discussion post will be deemed invalid and will not receive a grade for the discussion section of the course. Therefore, teachers are suggested to give immediate feedback on the posts in the study discussion and make teaching adjustment according to the process prediction.

7.6 The types of knowledge and the depth of online discussion

The types of knowledge have certain influence on the depth of forum discussion. The score of strategic knowledge discussion post is higher than that of declarative knowledge discussion post, which indicates that strategic knowledge can promote in-depth discussion among online learners. This may be because strategic knowledge can guide students to master rules and promote students complete learning. Therefore, in online discussion, teachers need to provide more strategic knowledge, cultivate students’ reflective ability and improve students’ creativity.

7.7 Value of this research

In this study, the content analysis is used to quantify, statistics is used to test variables related to MOOC online discussion data, and the machine learning classification algorithm is used to train sample data to build a discussion depth evaluation model, through which unknown discussion data can be classified and predicted. To analyze the MOOC discussion data in an objective, accurate and efficient way, and to study online learning through data model, this kind of data analysis and prediction method is one of potential research directions in online learning context. For example, Almatrafi et al. (2018) built an emergency discussion predictive model and Qin (2017) developed the model on predictive posts. Based on the perspective of deep learning and cognitive development, this study conducted in-depth analysis on the data of MOOC course discussion forum and established a prediction model, providing reference for MOOC course developers and researchers who focus on online learning.

In online learning, in-depth discussion is one of the key points of course teaching. Educators need to focus on the type of questions, duration of questions, and feedbacks to questions in the topic discussion. These are not only important characteristic-indicators for establishing mathematical models (similar indicators are also found in the research by Spatariu et al. (2007), but also important factors affecting the depth of online discussion. Gerber et al. (2005) also reached this conclusion about online discussion. The present research can help identify which problem and knowledge needs attention and guidance during the online teaching process. If the teacher is teaching online, the model can be used to infer whether teaching questions lead to deep learning, etc. Therefore, teachers need to pay more attention to the types of problems and to guide students to think deeply about problems and improve students’ cognitive development. This is especially important under the influence of the epidemic when most students learn online.

7.8 Limitations and future research

The classification model trained by the feature set of this study performed well, but did not reach superiority (> 95%). The possible reason is that the selection of prediction features to measure the depth of online discussion, as well as the processing method of online discussion data had some problems. For example, non-linear features need to be converted; the manual data annotation is susceptible to the knowledge background of annotators; the model does not perform well when prediction features are missing.

The logistic regression model is one of machine learning classifications, and decision trees, support vector machine and random forests are common methods of machine learning. In future studies, appropriate machine learning algorithms will be selected to improve machine classification performance according to characteristics of the algorithm, serving for learning and teaching in MOOC.

In addition, this research only collected the data from one course on MOOC platform of Chinese universities. The research scope is too small and the number of research samples is very limited. There are not enough discussion posts available for a course, and the data set was not large enough to train a high-performance model. So, the next step will be to integrate discussion posts from more MOOCs, increase the data set for training, and train models with higher performance. In subsequent studies, models can be constructed by using gradient ascending tree, neural network, Naive Bayes and other methods to further compare their performance, and relevant prediction standards will be established to provide targeted services for learning and teaching in MOOC.

With the emergence and development of big data and analytics techniques, more results will be predicted in future studies based on the data. Machine learning produces models that predict learning outcomes based on data generated in MOOCs. In this study, the learning results of MOOC were closely related to the types of learning questions, with 0.87 types of questions leading to about one in-depth discussion. Therefore, in MOOCs discussion, we firstly need to pay attention to types of questions. The prediction models will assist teaching and learning more efficiently in the future.