Effect of flipped classroom and automatic source code evaluation in a CS1 programming course according to the Kirkpatrick evaluation model

This paper proposes to evaluate learning efficiency by implementing the flipped classroom and automatic source code evaluation based on the Kirkpatrick evaluation model in students of CS1 programming course. The experimentation was conducted with 82 students from two CS1 courses; an experimental group (EG = 56) and a control group (CG = 26). Each student in each group completed 15 programming tasks. The level of knowledge of the participants acquired between the two groups is measured using the Kirkpatrick model, taking as a source a pre-test of previous knowledge, the grade assessment, the time of the activities, and a post-test of learning achieved. When comparing the submits time between the experimental and control group, it is observed that the value of the means is similar for the EG and CG; in this case, time is not a factor for comparison between the groups. However, in the grading, the value of the means is different for the EG and CG; EG students scored better than CG students. The evaluation of the Kirkpatrick model shows that the strategy implemented, on the one hand, does not improve the time of the activities; on the other hand, it improves the grades in the CS1 course.


Introduction
In recent years, higher education institutions have started to implement strategies and tools to evaluate the effectiveness of learning processes (Lacave et al., 2018).The flipped classroom (FC) is a pedagogical strategy that allows students to acquire their own knowledge through collaborative work with the teacher and other classmates (Ramírez & Burbano, 2014).In (Mok, 2014), 'flipping the classroom' was found to support active learning, collaboration and problem solving.It also reports that 96% of teachers who use FC recommend its implementation because both student participation and grades improve.
Other contributions are related to course analysis, where teacher practices are evaluated.Learning systems or intelligent tutors have also been built to help break down complex tasks into simpler activities.In this way, students can generate solutions based on sequences of steps that contribute to their learning process (Algayres & Triantafyllou, 2019;Avdeeva et al., 2015).The aforementioned studies identify various activities that improve the learning experience and support the teacher in meeting the set objectives for the course (Alario-Hoyos et al., 2017;Djenic & Mitic, 2017).
Research into programming courses has focused on reducing dropout rates, and improving academic performance and communication skills (Bansal & Singh, 2015;Ibanez-Cubillas et al., 2018).To this end, studies have gathered data on student behavior, learning management systems or tools (LMS or Moodle), time spent online, completion of activities, attendance and level of knowledge acquired (Ahmad Uzir et al., 2020;Sun et al., 2016).All these data collection tasks can be carried out with automatic evaluation tools which can provide analysis on the activities submitted by students (Pereira et al., 2021;Hidalgo & Bucheli, 2019).However, there are few contributions related to the effectiveness of programming courses and this is important in order to identify the pedagogical strategies which best support the teaching-learning process (Dávila, 2016).
Pedagogical strategies use techniques and methods that contribute to academic learning and evaluation.They integrate statistical elements and metrics to measure the teaching-learning process.They identify the effectiveness of course activities and the strategies used by students (Perez et al., 2017).However, higher education institutions must evaluate the attitudes, abilities and skills of all those involved in the training process, in order to improve the quality of education (Díaz Barriga Arceo et al., 2019).
Various models can be used to measure the efficiency of a course, one of which is the Kirkpatrick model (Kaufman & Keller, 1994).This model consists of four levels of evaluation: Reaction, Learning, Behavior and Results and it has been widely applied across different educational settings (Ruiz & Snoeck, 2018).It is considered useful by some academics for assessing student learning and it provides techniques to evaluate any training program (Kusumaningrum et al., 2022;Zokaei & Shakerian, 2022).The model can be used to measure course effectiveness or to evaluate whether a program meets the needs and requirements of the institution (Alsalamah & Callinan, 2021;Ho et al., 2016).This paper includes this model because it is able to evaluate the contributions made by the FC pedagogical strategy chosen and also whether the strategy and the automatic evaluation of the source code contribute to the learning outcomes proposed in the programming course.
There are a few research studies that evaluate the effectiveness of programming courses.This task is very important because it allows identifying problems in the courses and implementing strategies to support the teaching-learning process.For Education and Information Technologies (2023) 28:13235-13252 this reason, we propose this paper, which uses the Kirkpatrick model to evaluate the effectiveness of programming activities, the student's perception of the teacher, the pedagogical tools and the learning platforms implemented during the course.Also, the proposed learning strategy is evaluated, which contributes to the delivery times and grades obtained by the student.
The remainder of this paper is organized as follows: Section 1.1 presents some studies related to the research topic.Section 2 describes the method used in the research, including the research questions, population and sample, learning strategy and course evaluation.Section 3 presents the results obtained from the quantitative and qualitative analysis of the population and from the Kirkpatrick model.Section 4 is the discussion, and Section 5 draws together the key conclusions.

Related work
Pedagogical strategies improve the teaching-learning process through the use of methods and tools (Lacave et al., 2018).Teachers and researchers use surveys, experiments and case studies to evaluate the proposed strategies.However, it is necessary to evaluate the effectiveness of the learning process.The Kirkpatrick model defines four levels for this evaluation.The first level, reaction, evaluates students' satisfaction with the learning experiences.It suggests using a questionnaire to evaluate the subject taught, the teaching materials used, and the quality of the teacher (Ruiz & Snoeck, 2018).The second level, learning, evaluates the degree to which participants acquired knowledge and skills during the learning process.It determines whether there is an increase in knowledge and measures whether the course expectations are met (Diaz et al., 2018).The third level, behavior, evaluates how the participants transfer the knowledge and skills acquired in the training process (Galloway, 2005) and the fourth level, results, measures the performance of the institution based on improving quality and reducing costs.(Ruiz & Snoeck, 2018).
The literature review found several research papers that use the Kirkpatrick model to evaluate the effectiveness of learning.Some articles implement all the levels proposed in the model, others use three levels and some include only two.The studies by Ruiz and Alsalamah use the four levels proposed by the Kirkpatrick model.In Ruiz's, the model is adapted to evaluate learning in scenarios where the teaching is supported by didactic tools.Level 1 evaluates the contributions made by the didactic tool to improve user acceptance in computer-assisted learning environments.At level 2, metrics are defined to survey the student about the content learned during the course and to compare traditional teaching versus teaching with didactic tools.Level 3 evaluates the extent to which a didactic tool supports training.They propose a qualitative metric where real-life tasks are simulated.Level 4 evaluates the results of the process based on two metrics that include the number of students who pass the subject and the grades obtained (Ruiz & Snoeck, 2018).Alsalamah's study uses the Kirkpatrick model and all the proposed levels to aid the teacher with the assessment of learning outcomes.Metrics and instruments are defined which can identify the strengths and weaknesses of the formative process (Alsalamah & Callinan, 2021).
In the proposed study by (Hamtini, 2008) three levels of the Kirkpatrick model will be used to evaluate computer-assisted learning programs.The reaction level measures the ease of use of the learning interface and user satisfaction.A Likerttype survey was conducted to determine the effectiveness of learning in the computer-assisted environment and to ensure the learner is given the opportunity to learn the content.The learning level assesses whether the student has acquired the skills and knowledge required by the course.It uses timed assessments (pre-test and post-tests) to determine whether there has been significant learning.The results level then evaluates the knowledge acquired by the student and the ability to respond efficiently once the course activities have been completed.
Similarly in (Diaz et al., 2018) "Marketplace" is defined as a gamification-based proposal that offers rewards to students who give the correct answers in programming quizzes.They use levels 1 and 2 from the Kirkpatrick model to evaluate the teaching process in programming.For level 1, a survey is designed to evaluate student motivation and in level 2, a questionnaire based on an official survey instrument from the Universidad de La Frontera is conducted to measure students' knowledge, skills and attitudes on the course.The results obtained indicate that Marketplace increases student motivation during the learning process, improving academic results.
In the literature review, it was observed that the Kirkpatrick model helps to evaluate students' perceptions of the pedagogical tool, their teacher and the learning platform used.It can also evaluate submission times for graded tasks and the grades achieved by the student during the learning process.However, it can be noted that few studies have evaluated the effectiveness of activities related to programming courses.For this reason, this current paper uses the Kirkpatrick model in order to contribute to the design and evaluation of CS1 programming courses.

Method
This section describes the research questions that motivated the integration of the flipped classroom strategy and the automatic evaluation of source code in the programming course activities.It also includes the population, the sample selected for testing and the learning strategy implemented.

Research questions
The teaching-learning process requires integrating learning strategies, models, and tools that support the learning process and student assessment.In this paper, a learning strategy is implemented in a CS1 programming course, in order to answer the following research questions: RQ1: What is the student's perception of the teacher, the pedagogical tools, and the learning platform used during the course? 1 3 Education and Information Technologies (2023) 28:13235-13252 RQ2: What are the delivery times and the grades achieved with the use of the learning strategy?RQ3: How to measure the effectiveness of programming activities using the Kirkpatrick model?

Population
This research was developed in a Fundamentals of Functional Programming (FP) course, also known as the CS1 programming course.The course is oriented in the first semester and is part of the Systems Engineering undergraduate program at Universidad del Valle-Cali, Colombia.Here the theoretical and practical foundations of the functional programming paradigm in the JavaScript programming language are taught.Students must complete 128 h of work over 16 weeks.Each week the teacher develops a theoretical and practical class in blocks of 120 min each.The remaining hours are worked by the student independently.
The course includes learning data and control structures, data relationships, and algorithmic thinking to solve computational problems in a programming language.Likewise, it presents the highest dropout rate and low grades compared to other courses at the same university.

Sample
Each semester, on average, 560 students enroll in the FP course.To extract the sample, the (Cantoni, 2009) proposal was used, which includes the population size N, confidence level Z, standard deviation S, and the allowed sampling error e (See Eq. 1).
For the research, the population N of 560 students was used, with a confidence level Z = 1.96, a standard deviation S = 0.5, and a permissible sampling error of 10%, that is, e = 0.1.The values were then substituted into the formula.The resulting n value was 82.10, this indicates that a representative sample for the work is 82 students.Based on this information, the groups for the tests and the proposed learning strategy were defined.The groups were formed randomly from two groups for the semester 2022-2.The first group was called the Experimental Group (EG) includes 56 students.The second group was called the Control Group (CG) and was made up of 26 students.
Cronbach's alpha was also used to validate the reliability of the data generated in the results.This statistical measure calculates the pairwise correlations between items in a survey or questionnaire.The generated values can be between a negative value and one.If the consistency of the data is between 0.7 and 0.8, your result is acceptable.If the value obtained is between 0.8 and 0.9, the consistency is good.If the value reached by the alpha is greater than 0.9, the consistency is excellent.To carry out the process, a matrix was created with the students (rows) and the questions (columns).Then the cronbach a lpha function of the pingouinlibrary was used, which receives the matrix described above as the input parameter.

Learning strategy
The course is based on the submissions of programming activities of the contents developed in the semester.To evaluate the programming activities, students submit their source code through the M-IDEA automatic code evaluation tool (Hidalgo & Bucheli, 2019).Each student must complete 15 programming activities related to compound data types, using the functional programming paradigm.The rating of these activities ranges from 0.0 to 5.0.The lowest rating is 0.0 and the highest is 5.0.The final mark of each student is calculated by taking the average of the 15 programming activities carried out.In the EG, the learning strategy that included FC and M-IDEA and integrated three learning activities was used.The first activity consisted of completing a quiz of three theoretical questions and a programming exercise to be completed in the first 20 min of each class.This activity can be used to assess the students' prior knowledge.After reviewing the study material provided by the teacher, which contains guides and short videos with programming concepts and exercises.The theoretical questions can be multiple choice or false true.The programming exercise must be performed in the JavaScript programming language and sent in the M-IDEA tool to evaluate the inputs and outputs proposed in the test cases.
In the second activity, the teacher shares the students' quiz results in class, to bring to light any common mistakes or misunderstandings about the evaluated topic.Finally, in the third activity, students develop collaborative programming exercises in class following the teacher's instruction.These exercises are evaluated with the M-IDEA tool and include the concepts and examples of source code proposed in the guides and videos from the first activity.
In the CG, the learning strategy includes the traditional teacher-led lecture and the M-IDEA tool.In both groups (EG and CG) students must develop and submit through the M-IDEA 15 programming activities and the maximum deadline for submission is 34 days.For each activity they must send the source code and wait for evaluation with the respective feedback.If the work submitted is without errors, a student will be awarded a maximum of 100 points.For work containing coding errors, students will be awarded 0-99 points accordingly.Students can submit and evaluate their source code as many times as required.The objective is to review the syntax errors identified by the tool and to achieve a pass grade in the activity.During the evaluation process, the teacher should clarify any uncertainties or questions a student has about the proposed activities.
Table 1 presents the number of students, the number of programming activities completed, length of time for submission in days, and the learning strategy implemented in the EG and the CG.

Course evaluation
To measure the efficiency and impact of the strategy based on FC and the automatic evaluation of source code, the four levels proposed in the Kirkpatrick model are used to evaluate the efficiency of learning by comparing the results of the time used for the delivery of the evidence and qualifications, between the EG and the CG.
Reaction level Presents the number of students, the number of programming activities completed, length of time for submission in days, and the learning strategy implemented in the EG and the CG (See Table 1).
Learning level At this level, a data collection instrument (questionnaire) was designed based on the 7 steps proposed in (Fontela, 2017).The questionnaire was prepared by 3 teachers of the FP course; in which 2 questions were obtained for each topic seen in class, except in the last class topic, where there are 3 questions, resulting in 21 multiple-choice questions with a single answer.Each question is focused on identifying the ability to"know" acquired by students.The questionnaire is implemented for the control group and the experimental group, at the beginning of the course as a pre-test, and at the end of the course as a post-test.
Behavior and results level After the submission time of the proposed learning evidence (34 days), the results obtained in the experiment are analyzed.For this process, the Mann-Whitney statistical test (McKnight & Najab, 2010) was defined, which allows comparing the average for the variables used in the EG and CG from a null hypothesis H0.
In the process, the p-value corresponding to the significance level is obtained.If the value found is less than 0.05, the null hypothesis is rejected because it is concluded that the mean between the EG and CG differs with a significance level of 5%.But if the p-value found is more significant than 0.05, the null hypothesis is accepted, indicating that the mean value for the two groups does not differ significantly.
In this work, the following variables are used: the time used by the students to deliver the activities and the results obtained in the qualifications, for which the hypotheses H0 and H1 are defined.The H0 or null hypothesis can verify if the EG and CG mean similar.In contrast, hypothesis H1 allows verifying if the mean for the two groups is different.MedGE corresponds to the standard of the experimental group (H0: MedGE = MedGC), and MedGC corresponds to the mean of the control group (H1: MedGE = MedGC).Figure 1 shows the architecture, which integrates the phases from the construction of the activities to the analysis of the results.The student teaches his tasks through the M-IDEA platform, where he sends his exercises (solution in source code).

Results
In the EG, the FC and M-IDEA learning strategy was used, in order to generate a previous preparation of the student before the class.In the GC the master class and M-IDEA were used.In the first activity of the strategy that defines the questionnaire of theoretical questions and the programming exercise, four categories for the EG were analyzed: the number of students who presented, passed, did not present, and did not pass.It was identified that in the first two variables, 90.21% of the students took the test, and 90.07%passed it.This indicates that at least 50 EG students achieved successful reinforcement from the implemented strategy.In the last two categories, the results achieved were 9.79% for the students who did not present and 9.93% for those who did not pass.This means that fewer than 8 students did not take or passed the CF questionnaires.

Reaction level
In order to respond to RQ1, a survey was completed by the 56 students in the EG and the 26 students in the CG in order to evaluate their perceptions of the teacher, and the pedagogical tools and learning platform used during the academic semester.The first item evaluated was: use of materials in class and explanation of the topics by the teacher.Figure 2 presents the results obtained.The mean for the EG is 4.06 and its standard deviation is 0.788, while for the CG, the mean is 4.10 with a standard deviation is 0.785.The second item evaluated was: use of collaborative tools for course development.The evaluation results produced a mean of 4.33 with a standard deviation of 0.777 for the EG, and a mean is 4.17 with a standard deviation of 0.862 for the CG (see Fig. 3).
The last item evaluated was: learning platform used for the training process.Here the mean for the EG was 4.30 with a standard deviation of 0.809 whereas the CG obtained a mean of 3.78 and a standard deviation of 1.166 (see Fig. 4).

Learning level
A comparison was made of the time taken by students to submit their work for the programming activities and the grades obtained in order to answer RQ2. Figure 5 compares the time (in days) spent on the submission of the 15 programming activities.The EG is shown in pink on the graph and the CG in purple.It can be seen that the CG students achieved a lower median compared to the EG.This indicates that the CG participants took less time to submit their work.However, it is necessary to perform further tests to evaluate the quality of the submissions made.
The grades obtained by the students in their exams were analyzed based on statistical calculations and a box-and-whisker plot.In Fig. 6 it can be seen that the EG achieved a higher median compared to the CG.This indicates that the implemented strategy (FC+M-IDEA) supported students to obtain better results in their assessed activities.It is also observed that the proposed strategy is effective for activities  Education and Information Technologies (2023) 28:13235-13252 related to programming.However, it is necessary to perform further analyses that can support these points.
Figure 7 includes the data of the variables used in the analysis -time taken by students for submitting work and student grades -and presents them in a 3D histogram.The EG is represented by the green bars and the CG by the orange.It can be observed that the time taken for the submission of the activities in the two groups is between 10 and 25 days.Although some students used the maximum submission time (34 days), others took just three or four days to complete and submit the activities.
With respect to the grades, it can be seen that the EG students achieved better results than the CG.In most cases, in the EG group the grades are above 3.0, whereas in the second group, most of the grades fall between 1.0 and 3.0.This indicates that there is no linear relationship between the time taken for the activities and the grades obtained.It is also evident that each student takes the time they consider pertinent to complete and submit their work.
Finally, the Mann-Whitney statistical test was performed for the 56 EG and 26 CG students.The procedure used the Python library scipy.stats,which incorporates the Mann-Whitney function and receives as input parameters the data from the EG and CG.The result obtained for the submission time variable was: p-value = 0.3453.Since the resulting p-value is greater than 0.05, the null hypothesis is accepted and it is concluded that the difference in the time taken for submission is not statistically significant between the two groups.
The same procedure was performed to compare the means of the student grades with a p-value result of 1.025e-09.With a p-value of less than 0.05, the null hypothesis is rejected and it can be concluded that the grades differ significantly between the EG and CG students with a significance level of 5%.

Behavior and performance level
For RQ3, student knowledge was determined using a pre-test and a post-test.In this process, a quiz with 21 questions on the topics covered in the course and on the programming activities was used.Figure 8 presents the percentage of correct answers obtained by all students for each of the questions asked.The pre-test results show that the highest percentage obtained was 8% in Questions 1 and 2. This indicates that the students at the beginning of the semester do not have sufficient knowledge of the topics to be covered in the course.At the end of the semester, the post-test was given with the same questions used in the pre-test.It was found that the mean/percentage of correct responses for all questions was above 88%.This indicates that after completing the course content and activities, the students acquire the knowledge required to achieve the learning outcomes of the CS1 programming course.

Discussion
Many computer programming-related courses are held every year.The design of these courses consumes time and resources in the institutions.In these circumstances, it is important that the learning strategies are indicated to determine the degree of impact on the requirements of the students and if they achieve the necessary academic skills.So, it is necessary to follow the teaching process in terms of practical results and student grades, otherwise, the teacher, the academic unit, or even the university may be unduly blamed when the student decides to drop out of a course.
Based on the FC approach and the automatic evaluation tool M-IDEA, a learning strategy is proposed that supports the training and evaluation process of students in a CS1 programming course.Kirkpatrick's model was used to measure the efficiency and impact of the strategy implemented in the programming course.To carry out this article it was necessary to do an exhaustive search in the literature to identify the cute edge.In this sense, it was identified that in the works found, different methodologies and tools are used to support the student's academic process, however, these contributions are validated through evaluation such as the z-score or Mann-Whitney-Wilcoxon test, which serve to validate the results, but they can bias their efficiency and effectiveness.While in this article, the sample size of 560 students is first identified, from which a representative sample of 82 students is taken, seeking that the applicability of the sample is correct and allows replicating this study.Then, to evaluate the results, the Kirkpatrick model is used, which allows for measuring 4 aspects: response, learning, performance, and results.In the latter, the Mann-Whitney-Wilcoxon test was used to identify how many students achieved the proposed activities.Finally, to validate the reliability and efficacy of the data generated in the results, Cronbach's alpha is used.With the proposed strategy, the efficiency of the course was measured from the programming activities, managing to improve an average of 5% in the student's final grade of a CS1 programming course.
On the other hand, Kirkpatrick's model provides techniques for evaluating the evidence of any course of learning; the model can be used to determine if a favorable outcome is limited to the attitudes and/or practices proposed in a course.Also, it allows for analyzing the acquisition of learning competencies and the positive impacts on the student's abilities and skills.In this paper, it was possible to identify that the implemented strategy turns out to be effective in programming activities where autonomous and collaborative work is used.Likewise, in the strengthening of soft skills, where students interact with each other and generate better results compared to traditional work.
From the perspective of the studies reviewed, the strategy proposed here has an impact on relevant aspects of some authors: 1) Support in identifying strengths and Education and Information Technologies (2023) 28:13235-13252 weaknesses in a training course (Hamtini, 2008).; 2) Use of tools with learning interfaces, human-computer interaction, and feedback (Alsalamah & Callinan, 2021); and 3) Use of learning methodologies such as gamification, active learning and the flipped classroom.All these contributions can help a student gain confidence in learning and improve their programming skills (Diaz et al., 2018).

Conclusion
The application of the levels defined by the Kirkpatrick model enabled factors such as the learning environment, the source code evaluation platform, the theoretical classes, among others, to be recognized as having an impact on a student's academic process.It can also be highlighted that the participants were given an opportunity to evaluate their knowledge before starting the course (pre-test) and at the end of the semester (post-test).In the post-test evaluation, the students achieved results above 80%.This indicates that each student has their own learning methods and these were successful for the learning outcomes assessed in the course.
The perception of the students in both groups was acceptable, reporting that they felt comfortable with the evaluation tool and the learning strategy.However, they (EG or CG) recommended some additional learning aids such as videos or guides to support them in the programming activities.
With M-IDEA, students can get instant feedback, reducing the time taken for improvement cycles.Likewise, the effort made by the student to solve a problem successfully can be observed.Also, teachers can observe where the student fails and how they use the feedback to approach the next iteration of the process.With this information, the teacher can adjust and develop indicators to support new learning strategies in programming courses.
For the development of a strategy based on code evaluation, it is important to integrate dynamic activities for the different topics seen in class.This enables teachers to observe the relationship between student and tool, the student to student relationship in collaborative tasks, and the student relationship with knowledge in general.In this way, learning strategies can be created and timely decisions can be made to benefit the student.When a student does not achieve the course objectives, alternatives should be developed that can improve the learning process.
For future studies, one proposal is to develop a tool that can generate early intervention in the programming activities sent and evaluated by M-IDEA.Students would acquire valuable feedback and information for the development of source code and the teacher can intervene where the tool identifies problems in a student's learning process.

Ethical issues
In applied questionnaire, the level of learning at the beginning (pre-test) and by end of the course (post-test), the students of the control and experimental group approved an informed the proper consent mentioning that this is an academic exercise and the confidentiality of the data supplied.
The objective of this instrument is to generate good conduct in the research carried out, as proposed by Petousi and Sifaki in their research (Petousi & Sifaki, 2020).In this way, we will contribute to science and society, avoiding harming policies, procedures, and guidelines.

Limitations of the research
This research has a number of limitations and it is important to be aware of them in order to adequately evaluate the results.For the development of the experiment we limited ourselves to 82 students from the Fundamentals of Functional Programming course (period 2020-II), which were divided into the experimental group (EG) with 56 students, and the control group (CG) with 26 students.This paper is limited to measuring the effectiveness of the course programming activities using the Kirkpatrick model, analyzing whether submission times and grades improved with the use of the FC strategy and observing students' perceptions of the teacher, the pedagogical tools and the learning platform used during the academic semester.

Fig. 1
Fig. 1 Architecture for FC-based and automatic assessment strategy

Fig. 2
Fig. 2 Student rating of materials used in class and teacher's explanation of the topics

Fig. 3
Fig. 3 Student rating of use of collaborative tools for course developments

Fig. 5 3 Fig. 6
Fig. 5 Time taken by students (EG (A) versus CG (B)) for the submission of the scheduled programming activities

Fig. 8
Fig. 8 Results obtained by question in the pre-and post-test programming quiz