Identifying Factors for Master Thesis Completion and Non-completion Through Learning Analytics and Machine Learning

,


Introduction
The master thesis is the last formal step in most universities around the world. However, all students do not finish their master thesis. A considerable number of students struggle with the thesis process, resulting in delays, disruptions, and non-completion of their degrees [1][2][3]. Such outcomes are devastating for individual students and academic institutions that waste time, money and energy, and for societies that are not strengthened with high-skilled workers [2,4,5]. Therefore, ensuring that students enrolled in graduate programs obtain their degrees in a timely fashion is in the best interest of students, higher education institutions and societies [4,6].
Previous research related to thesis projects has identified some variables that influence the performance of students undertaking thesis projects; variables that, in particular, point out the relation between the student candidate and the supervisor [2,5,14]. The specific student variables that have been indicated as influencing thesis completion are students' attitudes and motivation [10], the students' average entry grade [12], and the students' communication and language skills [13]. Among the supervisor variables, it has been shown that the supervisor's experience, research output and workload constitute factors of thesis success [13,15]. However, the review of the literature leads to the conclusion that there are few studies explicitly focusing on master thesis projects. Studies on completion of thesis projects mostly concern the doctorate thesis [16,17] while studies on master thesis completion tend to focus on the whole program, not the thesis specifically [18][19][20]. Furthermore, most studies have used a qualitative approach to investigate factors for thesis completion; single factors have been looked at in an isolated way with a primary focus on student variables and on completion factors (and not on non-completion and supervisor variables) [21][22][23]. Furthermore, there are few contemporary studies that look at factors for success and failure related to thesis work.
Today, the introduction of thesis management systems, such as SciPro from Stockholm University [24] and Thesis Writer (TW) from Zurich University of Applied Sciences [25], generate a lot of data concerning many aspects of the thesis process. This paves the ground for using learning analytics techniques in order to gain data-driven insights about thesis management and the factors that affect thesis retention [26]. Learning analytics have been used successfully to early map the indicators of successful course completion, inform course design, provide insights and feedback to teachers and students, as well as improve education outcome [27].
This study takes as a departure point to better understand factors that influence completionand in particularnon-completion of master thesis projects. More specifically, we ask: which student and supervisor variables influence completion and non-completion of master thesis? Can we predict completion and non-completion of master thesis using such variables in order to optimize the matching of supervisors and students?
To answer these research questions, we extracted data about supervisors and students from two thesis management systems, Daisy and SciPro from the Department of Computer and Systems Sciences, Stockholm University, which record large amounts of data related to the thesis process. The sample used was 755 master thesis projects supervised by 109 teachers. By applying traditional statistical methods (descriptive statistics, correlation tests and independent sample t-tests), as well as machine learning algorithms, we identify five central factors that can accurately predict master thesis completion and non-completion. Besides the identified factors and predictors that explain master thesis completion and non-completion, this study contributes to demonstrating how educational data and learning analytics can produce actionable data-driven insights. In this case, insights that can be utilised to, on the one hand, inform and optimise how supervisors and students are matched, and on the other hand, stimulate targeted training and capacity building of supervisors.

Identified Factors in the Literature Explaining Thesis
Completion and Non-completion Our literature review has led to the identification of two groups of factors that influence thesis outcomes: related to the student candidate and the supervisor. Below we give an account of what is known about these two groups of factors. Rennie and Brewer [10] using a grounded theory approach to investigate the problem of thesis delay proposed the term 'thesis-blocking'. They propose that thesis blocking factors are more numerous than factors leading to completing it in a timely fashion. Successful thesis completion is dependent on the candidate's conformity and acceptance of the process. Failure of the supervisor to handle a candidate's negative feelings is the reason why many candidates to be stuck in the middle of the path [10]. House and Johnson's findings point to the applicants' average entry grade as a decisive predictive factor of successful, timely completion [12], a finding that was corroborated by Jiranek [13] and Wright and Cochrane [28].
On the other hand, studies have shown that entry grade is not a significant predictor of completion [28,29]. In a study by Pascarella and Terenzini [30], it was shown that the background characteristics, including entry grades, only explain a small part of retention, while academic and social integration explain more.
Other student factors affecting the completion or non-completion include communication skills and language proficiency skills [13], self-reliance and independence [31]. However, a right balance and proactive planning along with institutional support could mitigate the impact and assist the candidates [2,5,13,15,31,32]. Contrary to what is a common belief, part-time older candidates appear to be better than their counterparts in their approach to research, other duties and being independent [28].
It has also been shown that supervisors behaviours are crucial in every stage of the thesis work, in supporting the thesis writing process, rectifying errors, suggesting directions and being responsible for arranging the defence [33]. Rennie and Brewer compare the supervisor's role in these cases to the writer's block phenomenon [10]. They suggested that both share essential features, the main problem being the writer's internalisation of the critical feedback by the supervisor and poor management of duties and time constraints.
A healthy relationship between student and supervisor is helpful for the success of the thesis. The thesis is an embedded social exercise more than most of the other educational projects, therefore collaborating with the supervisor, regular productive meetings and the ability to reach a shared understanding are central to the success of the project [2,5,14,34,35]. A relationship where the supervisor exerts a moderate control of the process and more significant affiliation was found to influence the successful outcome in terms of time to completion and completion rates [36]. Supervisor experience and research production is a factor that might affect positively [13].
In general, the supervisors support through all the stages of the thesis process is an indispensable factor [5,13,32,34,37]. On the contrary, supervisors that are overwhelmed by research work, teaching or multiple students have less time for students who have negative results on the thesis work [2,15]. Furthermore, students report that its central that supervisors provide constructive, on-time feedback, as well as encouragement [38].

Sample and Context
The sample for this study consisted of master students' thesis projects (n = 755) during the period between 2010 and 2017 at the Department of Computer and Systems Sciences, Stockholm University, Sweden. Since it takes approximately 350 days for students to complete a thesis project (from course registration to grade registration), data from the year, 2018 were excluded as they contained many projects likely to be completed after the data extraction. The dropout rate for the thesis project at the department is approximately 43% for the period studied. We have included all master thesis projects that adhere to the present curriculum for thesis projects.

Data Collection
A challenge in data collection for learning analytics is to avoid amplifying errors from different standards in data sources, especially if some sources are external and out of control. In this study, to minimise this risk for all data sources, we used data that are under the control of the university.
Data collection was performed in several iterative steps. Using SQL (structured query language) queries, we extracted data from two different data systems used by the department to record data about the thesis projects. From these systems, we collected thesis project data concerning both students and supervisors. Informed by factors identified by previous research [12,13], and taking into account additional variables that were available in the systems that record thesis data. We focused in general on three groups of factors that influence the academic thesis process, namely: (1) student's previous performance in the master program; (2) supervisor's thesis project performance and experience; and (3) supervisor's research output.
More specifically, we extracted the following variables: • Thesis project: start and completion date. From this, the number of days to completion was calculated. • The students (n = 755): the grade of the thesis, the average grade in the study before the master thesis, and the number of course credits received within the educational program.
• The supervisors (n = 105): number of scientific publications, the average number of scientific publications per year, number of complete/incomplete thesis projects, the average grade of thesis projects, number of started thesis projects, and average days of supervisors to complete thesis projects were calculated from the projects.
All data was anonymised by converting personal identifiers to fictive IDs. The researchers who did the analysis did not know the identity of the subjects. The data was subsequently prepared for statistical and predictive analytics by removal of extremeand null values and through the computation of relevant variables.
Ethical approval for this study was obtained through the Regional Board of Ethical Vetting in Stockholm. Consent for participating in this research was also obtained from the selected supervisors in the sample. Six supervisors and their associated thesis projects were excluded due to no consent for using their data were received.

Data Analysis
The analysis was performed using SPSS, and R. Spearman correlation test was conducted to investigate the correlation between incomplete thesis projects (dropouts) and student and supervisor variables. Multiple independent sample t-tests were performed in order to explore differences between completers and non-completers with regards to student and supervisor variables. The Shapiro-Wilk test of normality was employed and confirmed that the assumptions for the t-tests were satisfied.
For the predictive analytics, seven supervised machine learning classifiers were applied: Naive Bayes, Logistic Regression, Gradient Boosted Trees, Neural Network, Deep Learning, Decision Tree, and Random Forest in order to predict completers and non-completers of thesis projects. These classifiers were chosen because they are frequently used for predicting dropout, and each has demonstrated good and comparable performance in predicting at-risk students and dropout [39,40]. The data set was split into a training and testing set. The training set consisted of 70% of the total data set, and the testing set the remaining 30%. After the implementation of the predictive models, features were ranked using the information gain ratio. To prevent overfitting and increase robustness, 10-fold cross-validation was performed, where performances were measured from multiple iterations of cross-validation and averaged over iterations. To measure the prediction performance of the different models, the area under the receiver operating characteristic curve (AUC) was obtained, along with measures for precision and recall.

Results
In Table 1 the full descriptive statistics are presented. Among the 755 thesis projects studied, 57% were completed, and 43 remained non-completed (see Fig. 1).
After performing the descriptive analysis presented in Table 1, a correlations tests (Spearman's) was performed in order to study the correlation between completion of thesis projects and all student and supervisor variables presented in the above table. This analysis revealed that completion is significantly correlated with the ratio of incomplete thesis projects of supervisors (r = −0.36, p < 0.01), students' average grade in their study program at the university (r = 0.28, p < 0.01), supervisors total number of incomplete thesis projects (r = −0.17, p < 0.01), the average time it takes for supervisors to complete thesis projects (r = −0.14, p < 0.01), the ratio of supervisor thesis ideas (r = 0.10, p < 0.05), and supervisors average thesis grade (r = 0.09, p < 0.04). As can be noted, the ratio and total amount of unfinished thesis projects by supervisors presented the strongest correlations with thesis dropout, together with students' average grade during the educational program. All grade-related variables on a 6-item scale ranging from F = 0 to A = 5 Multiple independent t-tests were also performed in order to explore differences between completers and dropouts with regards to many student and supervisor variables. See Table 3 for a full presentation of the t-test results. Based on these tests, the following can be concluded: Significant differences were, however, not revealed concerning the total number of scientific publications published by supervisors, the total number of thesis projects supervised by the supervisors, or the total course credits received by students within the educational program prior the master thesis (Table 2).

Predicting Completion and Non-completion
Then predictive analytics was performed using several machine learning models (Naive Bayes, Logistic Regression, Deep Learning, Decision Tree, Random Forest and Gradient Boosted Trees) in order to predict the completion/non-completion variable using the features described in Table 1. The performance across the models showed AUC values between 0.74 and 0.83 (see Table 3). The Gradient Boosted Trees model proved to perform best concerning accuracy and AUC, with almost 72% accuracy in predicting completers and non-completers. The actual non-completers could be predicted with a 66% precision and 67% class recall; while the completers could be predicted with 76% precision and 75% class recall (see Table 4).
As can be seen from Table 5, the features with most weight were the ratio of unfinished thesis projects of supervisors, students' average grade during university studies, supervisors' total number of incomplete projects and the average time it takes for supervisors to complete a thesis project.

Discussion
Not finishing a master thesis is a devastating personal experience for students that costs precious time, loss of money and energy. Non-completion also results in a vast waste of faculty time and institutional resources, and a societal loss of high skilled workers [2,4,5]. This study took as a departure point to address this problem by using large amounts of thesis-related data generated in thesis management systems in order to create data-driven insights about the factors that influence completion and non-completion of master thesis projects. Such a learning analytics approach led us to identify factors that have not been reported on in the research literature.
The analysis of the data resulted in the identification of five central factors that influence students' completion and non-completion of master thesis projects. The strongest factor and predictor for non-completion, not reported on before, showed to be supervisors' history of incomplete thesis projects. This factor correlated more with incomplete thesis projects than student's academic performance before starting the thesis (which was the second strongest predictor) and was the factor/feature that had most information gain (weight) in the predictive models produced by the employed machine learning algorithms. Three additional factors/predictors were identified related to the supervisor, namely: (1) the average days it takes for supervisors to complete thesis projects; (2) the average grade of thesis works supervised, and (3) the average number of scientific publications produced by the supervisor per year.
Thus, the supervisor's historical thesis supervision performance and their performance as researchers, together with students' academic performance prior to the thesis, to a high extent determine success and failure of master thesis projects. While previous research mainly through qualitative studies has demonstrated that supervisors indeed play a significant role in the thesis process, by in particular pointing out how successful teachers supervise, this quantitative study identify actual predictors related to the supervisor and demonstrate the significant effect of supervisor historical performance on master thesis completion and non-completion, which constitute a central contribution of this study. However, the finding that students' academic performance prior to the thesis is a significant predictor has been reported on before and corroborate previous research [12,13,28].
Besides the identified factors and predictors that explain master thesis completion and non-completion, this study contributes to demonstrating how educational data and learning analytics can produce actionable data-driven insights. In this case, gained insights can be utilized to, on the one hand, inform and optimize how supervisors and students are matched, and on the other hand, stimulate targeted training and capacity building of supervisors.
Future research work can build upon this study and bridge its limitations by adding more contextual factors to the analysis, such as students' internal conditions and dispositions [41][42][43]. By dispositions, we mean behavioral and cognitive factors such as motivation (to write a master thesis, for instance), engagement, self-regulation skills, strategies and attitudes [43,44]. Such an approach would most likely increase the probability of finding additional factors that influence the master thesis process and increase accuracy, replicability and transferability of prediction models [27,45,46].