Introduction

Over the last decade, web has changed the way that educational content and learning processes are delivered to the students. It constitutes a new means for education and training, which is growing rapidly worldwide giving new possibilities and offering better, more efficient and intensive learning processes. Intelligent Tutoring Systems (ITSs) constitute a generation of computer-based educational systems that encompass intelligence to increase their instruction effectiveness. The main characteristic of ITSs is that they can adapt the educational tasks and learning processes to the individual students’ needs in order to maximize their learning. This is mainly accomplished by utilizing Artificial Intelligence methods to represent the pedagogical decisions they make and the knowledge regarding the domain they teach, the learning activities, the students’ characteristics and their assessment (Shute and Zapata-Rivera 2010). So, ITSs constitute a popular type of educational systems and are becoming a main means of education delivery, leading to an impressive improvement in student learning (Aleven et al. 2009; VanLehn 2006; Woolf 2010).

Assessment constitutes a fundamental aspect of ITSs. The aim of the assessment is to provide a measure of students’ comprehension and performance and assist both the educational system and the learner to get a deeper insight of his/her knowledge level and gaps (Jeremić et al. 2012; VanLehn 2008). Educational systems are becoming increasingly effective at assessing the knowledge level of students, utilizing systematic assessment and marking methods (Baker and Rossi 2013; Martin and VanLehn 1995; Pavlik et al. 2009). In such systems, assessment mechanisms are vital and can assist tutors to know how well the students have understood various concepts and to monitor students’ performance and class learning progress. More accurate assessments can lead to tutoring that is more adaptive to individual students and thus to more effective learning (Siler and VanLehn 2003). Assessments can determine how well students are learning on a continuous basis and assist in taking necessary corrective actions as soon as possible to improve students learning (Mehta and Schlecht 1998). Therefore, systematic assessment and marking mechanisms should be an integral part of any e-learning system (Kwan et al. 2004). In general, the assessment of students’ performance via their answers to exercises is considered to be a complex and time consuming activity that makes tutors cut down valuable time, which they could devote to other educational tasks. On the other hand, providing systematic assessment manually even for a small class cannot assure that tutor feedback will be as instant as in one-to-one tutoring (Ihantola et al. 2010). Manual assessment of students’ performance on exercises could delay feedback delivery by tutors to students for days or even for weeks. So, in some cases, tutors may have to reduce even the number of assignments, to be given to their students, due to lack of time. Especially, in large scale courses, accurate and meaningful assessment is a very demanding task for tutors. Also, accuracy is usually difficult to achieve, due to subjective and objective reasons. Automatic assessment can secure consistency in students’ assessment, since all exercises are evaluated based on exactly the same criteria, and also all assessments and marks awarded can be explained instantly and most of all deeply and in detail to the students (Suleman 2008). Therefore, the creation of mechanisms for automatic assessment is quite desirable.

Automatic assessment systems can assist tutors in evaluating students’ work and also enable more regular and prompt feedback (Barker-Plummer et al. 2008; Charman and Elmes 1998; Gouli et al. 2006). It is commonly acknowledged by tutors that students’ learning is enhanced by frequent assessment and proper feedback (Shepard 2005). While learning in educational systems, timely feedback is essential to students and automated marking mechanisms can give the opportunity to provide feedback on students’ work in progress (Falkner et al. 2014).

One of the major challenges that teachers face in teaching of computer science courses are the difficulties associated with teaching programming and algorithms, which are considered difficult domains for the students (Jenkins 2002; Lahtinen et al. 2005; Watson and Li 2014). Artificial Intelligence (AI) course is an important course in the computer science discipline. Among the fundamental topics of the curriculum of an AI course is “search algorithms”, including blind and heuristic search algorithms. It is vital for students to get a strong understanding of the way search algorithms work and also of their application to various problems. In general, search algorithms are complicated and many students have particular difficulties in understanding and applying them.

Usually, in an AI course, the tutor creates and gives a set of assignments asking the students to provide their hand-made solutions. Then, the tutor has to mark all students’ answers, present the correct ones and discuss common errors. This process is time demanding for the tutor, particularly when the number of answers is large. On the other hand, educational systems with graphical and interactive web-based tools are more appealing for students than the traditional way of doing exercises (Naps et al. 2002; Olson and Wisher 2002; Sitzmann et al. 2006). Therefore, we developed the Artificial Intelligence Teaching System (AITS), which is an ITS that can assist tutors in teaching and students in learning about, among others, search algorithms (Grivokostopoulou and Hatzilygeroudis 2013a). It supports study of their theoretical aspects and provides visualizations demonstrating the way that different algorithms function and also assist students in applying the algorithms via various interactive exercises and learning scenarios. Furthermore, an automatic assessment mechanism has been developed and integrated into AITS, which can assist tutors to reduce the time spent in marking and use this time efficiently for more creative tasks and personal contact with the students. Also, during students practice with interactive exercises, the automatic assessment mechanism instantly evaluates students’ actions and provides meaningful feedback. With the use of the automatic assessment, all students’ answers are assessed in a consistent manner; students can get their marks and feedback immediately after the submission of their answers.

The contributions of this paper are as follows. First, it introduces use of interactive step-based visualizations (or, in other words, visualized animations) of algorithmic operations in teaching and learning concepts of search algorithms in the context of an ITS. The aim is to achieve more effective learning by actively involving students in interactive visualizations through interactive exercises. As far as we are aware, there are no other similar efforts. Second, it introduces an automatic assessment mechanism for assessing students’ performance on exercises related to search algorithms. The mechanism takes into account the similarity between a student’s and the correct answer, the type of the answer in terms of completeness and accuracy as well as possible carelessness or inattention errors. The aim is to obtain a consistent and reliable assessment mechanism that assures more effective feedback. As far as we are aware, there are no other efforts that provide such a systematic assessment approach (based on similarity measure, systematic error categorization, systematic answer categorization and an automated marker) that can generalize to other domains. Both contributions are validated via experiments. For the first contribution, a pre-test/post-test and experimental/control group approach has been used. For the second, linear regression and classification metrics have been used.

The rest of the paper is organized as follows: Section 2 presents related work on educational systems for teaching algorithms and also works on automated assessment methodologies and tools that have been developed. Section 3 presents the Artificial Intelligence Tutoring System (AITS), by illustrating its architecture and analyzing its functionality. Section 4 presents the automatic assessment mechanism, describes the way it analyzes and evaluates students’ answers and presents the provided feedback. Section 5 presents the experimental studies conducted and discusses the results. Finally, Section 6 concludes the paper and provides directions for future work.

Related Work

Many research efforts and educational systems have been developed to assist teaching and learning in the domain of algorithms. PATHFINDER (Sánchez-Torrubia et al. 2009) is a system developed to assist students in actively learning Dijkstra’s algorithm. The highlighting feature provided by that tool is the animated algorithm visualization panel. It shows, on the code, the current step the student is executing and also where there is a user’s mistake within the algorithm running. TRAKLA2 (Malmi et al. 2004) is a system for automatically assessing visual algorithm simulation exercises. TRAKLA2 provides automatic feedback and grading and allows resubmission, which means that students can correct their errors in real time. In (Kordaki et al. 2008), an educational system, called Starting with Algorithmic Structures (SAS), designed for teaching concepts of algorithms and basic algorithmic structures to beginners, is presented. Students come across the implementation of algorithms in real life scenarios and the system offers feedback to correct their answers. In the work presented in (Lau and Yuen 2010), authors attempt to examine whether gender and learning styles can be used to associate mental models in learning a sorting algorithm. Results indicate that mental models of females are more similar to those of the expert referent structure and that concrete learners have a higher similarity in their mental models with the expert ones than abstract learners. Those findings can be utilized in designing educational processes and learning activities for assisting students in learning algorithms. In a previous work of ours (Grivokostopoulou et al. 2014a), aspects of an educational system, used in the context of an AI course, are presented and the AI techniques used for adapting learning to students are described. None of the above efforts, except that of TRAKLA2, although offer use of some kind of visualization, provide any mechanism for automatic assessment.

Also, recently, there have been research studies and systems that support automatic assessment of students in various domains. Marking of students’ answers to exercises and hence assessment of their performance are necessary to scaffold an effective learning process in an educational system. Also, in line with the delivery of appropriate formative feedback, it can enhance a student’s knowledge construction (Clark 2012; Nicol and Macfarlane-Dick 2006; Heffernan and Heffernan 2014). Assessment is necessary to update a student’s model and characteristics, specify student’s knowledge level and trace misconceptions and knowledge gaps for both individual students and the class. Indeed, studies from cognitive and educational psychology indicate correlations between self-assessment and learning outcomes, pointing out that students, who are aware of their own learning more accurately and timely, tend to have better learning outcomes (Chi et al. 1989; Long and Aleven 2013; Winne and Hadwin 1998). However, automatic assessment in general is considered to be domain dependent, which means that it necessitates knowledge of the domain’s main principles, concepts and constraints. So, a challenging research direction is the specification of a general framework for automated assessment. In this paper, we have made a step towards this direction (see section on automatic marking mechanism and Fig. 7).

The field, where automatic assessment is widely used, is computer science and especially computer programming (Douce et al. 2005; Alemán 2011; Ala-Mutka 2005). There are various systems that include a mechanism for automated assessment, to mark student programming exercises and provide feedback, such as ASSYST (Jackson and Usher 1997), Boss (Joy et al. 2005), GAME (Blumenstein et al. 2008), CourseMarker (Higgins et al. 2005), AutoLEP (Wang et al. 2011), Autograder (Helmick 2007). ASSYST utilizes a scheme that analyzes students’ programming submissions across a number of criteria and specifies whether submissions are correct by comparing the operation of a program to a set of predefined test data. Also, the analysis aims to specify its efficiency and whether it has sensible metric scores that correspond to complexity and style. CourseMarker evaluates students’ programming assignments in various programming languages, such as C and C++. It automatically provides feedback to students and reports to instructors regarding students’ performance. Its basic marking method relays on typographic, feature, and dynamic execution tests of the students’ exercises. CourseMarker also supports the formative aspects of assessment, allowing students to have their program graded at frequent intervals prior to submission. In order for this to be feasible, the profile of the program is constrained by measuring its attributes and its functionality in order to arrive at a grading. AutoLEP is a learning environment for C programming language that integrates a grading mechanism to mark students’ programs, which combines static analysis with dynamic testing to analyze those programs. It utilizes the similarity of students’ and teacher’s solutions and also provides feedback regarding compiler errors and failed test cases. To do so, it creates the graph representation of a student’s program and compares it with a set of correct model programs. BOSS supports student exercises assessment through collecting submissions, performing automatic tests for correctness and quality, checking for plagiarism, and providing an interface for marking and delivering feedback. It provides automatic assessment and also assists the lecturer in obtaining a higher degree of accuracy and consistency in marking. Also, it provides administrative and archiving functionalities. In general, BOSS is conceived as a summative assessment tool and, although it supports feedback for students, its primary function is to assist in the process of accurate assessment. QuizPACK (Brusilovsky and Sosnovsky 2005) is a good example of a system that assesses program evaluation skills. QuizPACK generates parameterized exercises for the C language and automatically evaluates the correctness of student answers. For the assessment, QuizPACK utilizes simple code transformations to convert a student’s exercise code into a function that takes the parameter value and returns the value to be checked. Also, it provides guidance to the users and assists them to select the most useful exercises in order to advance their learning goals and level of knowledge. QuizJET (Hsiao et al. 2008) is a system for teaching Java programming language supporting automatic assessment of parameterized online quizzes and questions.

In (Higgins and Bligh 2006), an approach to conducting formative assessment of student coursework within diagram-based domains using Computer Based Assessment (CBA) technology is presented. Also, in (Thomas et al. 2008), an automatic marking tool for learning and assessing graph-based diagrams, such as Entity-Relationship Diagrams (ERDs) and Unified Modeling Language (UML) diagrams, is presented. It specifies the minimal meaningful units of the diagrams and automatically marks student answers based on similarity values with correct answer and also provides dynamically created feedback to guide students.

Graph similarity methods are quite often utilized to analyze and assess exercise and students answers. In (Naudé et al. 2010), graph similarity measures are used to assess program source code directly, by comparing the structural similarity between a student’s submissions and the already marked solutions relying on the principle that similar vertices should have similar neighbors. In (Barker-Plummer et al. 2012), edit distance is used as an approach to analyze and characterize the errors in student formalization exercises. They report that edit distance is quite promising in examining the type and the nature of student errors. In (Stajduhar and Mausa 2015), authors mark student SQL exercises and statements by comparing the similarity of student’s SQL statements with reference statement pairs utilizing methods such as, Euclidian and Levenshtein word distance. Obtained results show that string metrics are greatly promising, given that they contribute in the overall predictive accuracy of the assessment method. Also, in (Vujošević-Janičić et al. 2013), tools for objective and reliable automated grading of introductory programming courses are presented, where substantial and comprehensible feedback is provided too. Authors present two methods that can be used for improving automated evaluation of students’ programs. The first is based on software verification and the second on control flow graph (CFG) similarity measurement. Both methods can be used for providing feedback to students and for improving automated grading for teachers. Authors report quite interesting results regarding the performance of the automatic assessment tools. Although the above efforts use the notion of edit distance in assessing the similarity between the student and correct answer and in characterizing errors, they do not take into account carelessness errors and do not do it in systematic way, as we do. Also, in WADEIn II (Brusilovsky and Loboda 2006) which is a Web-based visualization tool for C language, adaptive visualization and textual explanations are utilized in order to portray the process of expression evaluation.

In the context of AI course in our university, we have developed mechanisms that automatically mark exercises related to symbolic logic. AutoMark-NLtoFOL (Perikos et al. 2012) is a web-based system that automatically marks student answers to exercises related to converting natural language sentences into First Order Logic (FOL) formulas. AutoMark-NLtoFOL provides students with an environment for practicing and assessing their performance on converting natural language sentences into FOL and also for improving their performance by providing personalized feedback. In (Grivokostopoulou et al. 2012), a system that automatically marks students’ answers to exercises on converting FOL to clause form (CF) is presented. Both marking approaches utilize a domain error categorization to detect errors in student’s answers, mark them and provide proper feedback.

However, as far as we are aware, there are no works in the literature that have developed mechanisms or methodologies to assess students’ answers to interactive exercises on search algorithms, apart from two works of ours. In (Grivokostopoulou and Hatzilygeroudis 2013b, c) two methods for the automated assessment of student answers to exercises related to search algorithms are presented. The methods and the tools developed can assist the tutors in their assessment tasks and also provide immediate feedback to students concerning their performance and the errors made.

Artificial Intelligence Teaching System (AITS)

Artificial Intelligence Teaching System (AITS) is an intelligent tutoring system that we have developed in our department for helping students in learning and tutors in teaching AI topics, a basic one being ‘search algorithms’. The architecture of the system is illustrated in Fig. 1. It consists of six main units: Student Interface, Tutor Interface, Automatic Assessment, Test Generator, Learning Analytics and the Domain Knowledge & Learning Objects.

Fig. 1
figure 1

An overview of AITS architecture

During a student’s interaction with the system, e.g. while dealing with an interactive exercise or a test, his/her answer(s) is (are) forwarded to Automatic Assessment unit. Automatic assessment unit consists of three main parts: the Error Detection Mechanism, the Automatic Marking Mechanism and the Feedback Mechanism. Error detection mechanism is used to analyze the student’s answers, detect the errors made and characterize the student’s answers in terms of completeness and accuracy. After that, it interacts with automatic marking mechanism, which is used to calculate the mark for each student’s answer to an exercise and also specify the overall student’s score on a test. Feedback mechanism is used to provide immediate and meaningful feedback to the student regarding the score achieved and the errors made on individual exercises or a test.

A test in AITS is generated in a user adapted mode by the Test generator unit. Test generator unit utilizes a rule-based expert system for making decisions on the difficulty level of the exercises to be included in the test (Hatzilygeroudis et al., 2006), so that it is adapted to the knowledge level and needs of the student. Created tests consist of a number of exercises that examine different aspects of (blind and/or heuristic) search algorithms.

From the tutor’s perspective, a tutor also can connect and interact with the system through the Tutor Interface. The tutor can manage the educational content and the learning activities in the system, add new exercises and examples and also edit or even delete existing ones. For this purpose, an exercise generation tool (Grivokostopoulou and Hatzilygeroudis 2015) has been developed and embedded into AITS, aiming to assist tutors in creating new exercises in a semi-automatic way. Leaning Analytics unit, aims at assisting the tutor in monitoring students’ activities and supervising their learning performance and progress. Learning analytics unit provides tutors with general information regarding a student’s learning progress and shows statistics and common errors that students make. Finally, the Domain Knowledge and Learning Objects unit represents concepts related to search algorithms and their relations in a concise way, through an ontology.

Domain Knowledge and Learning Objects

The domain knowledge structure concerns AI curriculum concepts, related to a number of AI subjects. The AITS system consists of four main subjects: Knowledge Representation & Reasoning, Search Algorithms, Constraint Satisfaction and Planning. The domain knowledge is structured in a tree-like way. The root of a tree is a main subject (e.g. Search Algorithms, Constraint Satisfaction etc.). The main subject is divided into topics and each topic into concepts. In this way each subject deals with a number of topics and each topic deals with a number of concepts. The number of the topics depends of the subject, for example the subject Knowledge Representation and Reasoning consist of three main topics Propositional Logic, Predicate Logic and Reasoning and many concepts. The knowledge tree is displayed at the navigation area, at the left-hand side of the student interface. A student should specify a concept for studying (following a ‘subject-topic-concept’ path). As soon as this is done, corresponding learning objects, distinguished in theory, examples and exercises, are presented to the students.

The search algorithm section consists of two main topics which are the heuristic and the blind search topic and each one consists of a number of concepts. In Fig. 2, a part of the tree structure for search algorithms subject, concerning the path to heuristic function concept, is presented. The ‘theory’ objects consist of text, presenting theoretical knowledge about corresponding concept. The ‘examples and visualizations’ objects are visual presentations/animations of search algorithms operations, which in line with the theory aim to improve the students’ comprehension of related concepts. The ‘interactive exercises’ are exercises that students try to solve and give answers in a step-based interactive way. There are two main types of interactive exercises: practice exercises and assessment exercises. The practice exercises are interactive exercises that are equipped with hints and assistance during the learning sessions, aiming to provide guidance and help to the students. On the other hand, the assessment exercises are exercises that are used to examine the students’ progress and comprehension of the corresponding concepts. The assessment of students’ answers to exercises can be useful for both the students and the system. The system can get a deeper insight of each individual student’s knowledge level, skills and performance and adapt the learning activities and the topics for study to the students learning needs. Also, from the students’ perspective, self-assessment can help students trace their gaps and specify concepts needed for further study. Finally, a ‘test’ object consists of a set of exercises that a student is called to solve. The answers will be assessed and marked automatically by the system.

Fig. 2
figure 2

A part of the domain modeling of the course curriculum

Learning Analytics

The Learning Analytics unit records, analyses and visualizes learning data related to students’ activities, aiming to assist the tutor to get a deeper understanding of the students’ learning progress on search algorithms. More specifically, it provides information regarding each individual student’s overall performance and performance on specific topics, concepts or exercises. Also, it disaggregates student performance according to selected characteristics such as, major, year of study etc. Furthermore, it gives feedback to the tutor about the knowledge level and grades (with the help of the automatic marking mechanism) of each student, the lowest/highest performance on each exercise, the most frequent errors made, the exercises that have been tried, the time taken for each assessment exercise and for the test, the total time spent in the system etc. Moreover, it uses data mining techniques to make predictions regarding student’s performance (Grivokostopoulou et al. 2014b). Additionally, it provides information related to the assessment exercises, like the number of hints requested for each exercise, the student performance after the delivery of a hint and more. The statistics can assist the tutor in tracing concepts of particular difficulty for the students and also assist both the tutor and the system in getting a more complete insight of each student and better adapt the learning activities to each student’s needs and performance.

Learning Approach and Activities

A student first chooses a concept from the domain hierarchy for studying. Then, for the chosen concept, theory is presented first and afterwards examples and visualizations are provided to the student. The student, after “having played” with them, is called to deal with some interactive practice exercises, which aim to give him/her (a) a better understanding of what he/she has learnt so far, since practice exercises convey hints and guidance for any difficulties to be met, and (b) the opportunity to check whether he/she can apply what has learnt. In this spirit, some assessment exercises can be offered to the student, where he/she experiences a pre-test state and the system obtains a first evaluation of the student’s knowledge level. Finally, a test is formed, consisting of a number of exercises of various difficulty levels and complexity, based on the history of the student in dealing with practice and pre-test assessment exercises. In this way a student with a worse history performance than another student will get an easier test than the one of the other student. The results of the test may give a final assessment of the knowledge level of the student regarding the corresponding concept(s). The student is not forced to follow the system’s way of teaching, but can make his/her own choices for studying about a concept. Any student who has finished studying a particular concept can take the corresponding concept-level test and continue studying the next concept(s).

Visualization of Algorithms

In the learning scenarios offered by the system, a student can study the theoretical aspects of an algorithm alongside appropriate explanations and algorithm visualizations on various example cases. Algorithm visualizations/animations are well pointed to assist students in learning algorithms (Hundhausen et al. 2002). Visualizations, when properly used in a learning process, can help a student deeper understand the way that an algorithm operates, by demonstrating how it works and how it makes proper decisions based on parameters, such as heuristic and cost functions (Hansen et al. 2002; Naps et al. 2002). Therefore, there are many algorithm visualization tools for educational purpose such as, Jeliot 3 (Moreno et al. 2004) and ViLLE (Rajala et al. 2007).

Jeliot is a visualization tool for novice students to learn procedural and object oriented programming. The provided animations present step by step executions of Java programs. ViLLE is a language-independent program visualization tool that provides a more abstract view of programming. It offers an environment for students to study the execution of example programs, thus supporting the learning process of novice programmers.

In our system, during the visualization of an algorithm, every decision the algorithm makes, such as which node(s) to expand/visit, is properly presented and explained to the student. The system explains how a decision was made by the algorithm and how the values of parameters, such as the heuristic and the cost functions (if any), were calculated for each algorithm’s step.

A noticeable aspect of our algorithm visualizations is that they have been developed according to the essence of student active learning. They have been designed based on the principle of engaging students as much as possible in the demonstration process and making them think hard at every step of an algorithm’s process animation. The principles of active learning postulate that the more the users directly manipulate and act upon the learning material, the higher the mental efforts and psychological involvement and therefore the better the learning outcome. In this spirit, during an animation demonstrating the implementation of an algorithm in a case scenario, the system can stop at a random step and ask the student to specify some aspects regarding the operation of the algorithm. The animation may engage the student and request from him/her to specify the next action to be made or ask him/her to justify why an action was made. In general, such justifications mainly concern either the last one (or more) action conducted by the algorithm or the specification and proper justification of the next action to be conducted. The interaction with the student and the questions asked are mainly multiple choice questions, where the student has to specify the correct answer. For example, during a visualization the system can pause and ask the student to specify the next algorithm’s step by selecting the proper answer of a multiple choice question. In case of a correct student’s answer, it can also request from student to justify the reason, by offering additional multiple choice question(s). In case of an erroneous answer, the correct response and proper explanations are immediately offered to the student. After an interaction with the learner, the animation process continues. So, during an algorithm’s visualization in an example exercise scenario, multiple interactions with the learner can be made. In Fig. 3, the explanation of the functionality of A* on an example exercise via step-by-step animations is illustrated.

Fig. 3
figure 3

Visualization of the operation of A* on an example case

Interactive Exercises

The system provides two types of interactive exercises: practice exercises and assessment exercises. The practice interactive exercises provide help and immediate feedback after a student’s incorrect action, because the main objective is to help the student learn. In practice exercises, students are requested to apply an algorithm according to a specific exercise’s scenario, which can concern the specification of the whole sequence of nodes (starting from the root of the tree), a specific sub-part of it (starting from an intermediate node consisting of some steps of an algorithm) or the specification of the next node(s)/step(s) of the algorithm. In Fig. 4, a practice exercise related to breadth-first search (BFS) algorithm, the corresponding feedback error report and the given mark are presented.

Fig. 4
figure 4

A practice exercise on breadth-first search algorithm

Furthermore, the system provides interactive assessment exercises that are used to examine the student’s progress and comprehension. They can be used by the students themselves to measure their knowledge and understanding and also by the system to get a deeper understanding regarding students’ skills, interests and provide personalized instructions, tailored to students’ learning needs (Aleven et al. 2010; Rosa and Eskenazi 2013). The system does not provide any feedback during the student’s interaction with assessment exercises; feedback is provided after a student has submitted an answer. In Fig. 5, an assessment exercise on A* star algorithm and the corresponding marks are presented.

Fig. 5
figure 5

An assessment exercise on A* algorithm

Feedback Provision

Another aspect of the system is that it provides meaningful and immediate feedback to students. Several research studies have shown that the most important factor for students learning in an educational system is feedback (Darus 2009; Hattie and Timperley 2007; Narciss 2008).

Human tutoring is considered to be extremely effective at providing hints and guiding students learning. A key aspect of tutoring is that the tutor does not reveal the solution directly, but tries to engage students in more active learning and thinking. Furthermore, research studies underpin that computer generated hints can effectively assist students in a similar degree to a human tutor (Muñoz-Merino et al. 2011; VanLehn 2011). In this spirit, the system aims to make students think hard so that they finally achieve to give a correct answer without unnecessary or unasked hints. The system never gives unsolicited hints to students. If a student’s answer is incorrect, proper feedback messages are constructed and are available via the help button. The student can get those messages on demand by clicking on the help button.

A feedback mechanism in AITS is used to provide immediate and meaningful feedback to students during learning sessions. With regards to practice exercises, feedback can be provided before a student submits an answer and also after the student has submitted an incorrect answer. So, a student can ask for help while working on an exercise and before he/she submits an answer. An answer to an interactive exercise mainly depends on the exercise’s type and characteristics and in general concerns specification of the steps of an algorithm trying to solve a problem. This may refer to specification of either all of the steps or part of them or even a single step. The system’s assistance before the student submits an answer can remind student of the corresponding concepts involved in the exercise and also orient student’s attention to specific elements of the current step of the algorithm that may be tricky, thus leading to high error rates and failed attempts. In this spirit, the system can provide hints to the student regarding the algorithm’s functionality in the current exercise’s conditions. This kind of assistance and the corresponding help messages are provided on demand after a student’s request for assistance.

On the other hand, for the assessment exercises, feedback is provided only after the student submits an answer. The system provides feedback about the correctness of the students’ answers as well as information about the types of the errors made and their possible causes. Feedback information is provided at two levels. At the first level, after a student submits a test, the system informs him/her about the correctness of the answers to each involved exercise. Then, the system recognizes the errors made and provides proper feedback to the student. Also, it provides the grade (mark) achieved by the student, as estimated by the automatic assessment mechanism. In Fig. 6, the feedback provided to a student for an exercise of a test is presented.

Fig. 6
figure 6

Feedback to a student’s answer

Error Detection Mechanism

The error detection mechanism is used to recognize the errors made by a student and interacts with the feedback mechanism unit to provide input for creating feedback as well as with the automatic marking mechanism to give input for calculation of the mark.

Similarity Measure

An answer is represented by the sequence of the nodes visited by an algorithm while trying to solve a problem. In terms of data type, an answer is represented by a string. Calculation of the mark of an answer to an exercise is based on the similarity of the student answer to the correct answer, i.e. on the similarity of two strings. Error detection mechanism analyzes student answers and estimates the similarity between the sequence of a student’s answer (SA) and the sequence of the corresponding correct answer (CA). The similarity between SA and CA is calculated using the edit distance metric (Levenshtein 1966).

The edit distance between SA and CA, denoted by d(SA,CA), is defined as the minimal cost of transforming SA to CA, using three basic operations: insert, delete, and relabeling. We define the three basic edit operations, node insertion, node deletion and node relabeling as following:

  • node insertion: insert a new node in a sequence of SA.

  • node deletion: delete a node from a sequence of SA.

  • node relabeling: change the label of a node in SA.

The node insertion and node relabeling have the following three special cases:

  • Second_Node_Insertion (SCI): insert the second node in a sequence.

  • Second_Node_Relabeling (SNR): change the label of the second node in a sequence.

  • Goal_Node_Relabeling (GNR): change the label of the goal node (last node) in a sequence.

The rationale behind the above distinctions is that we are interested in distinguishing the cases where an error is made in selecting the second node in a sequence (actually the first choice of the student) or the goal node (actually the last choice). This is because they are considered as cases of more serious errors than others.

Given a SA and corresponding CA, their similarity is determined by the edit distance, considering that the cost for each basic operation on the nodes (insertion, deletion, relabeling) is 1. Also, we consider the cost of applying one of the special case operations (SCI, GNR, etc.) to be 2. The above choices are based on results of experimental studies.

As a first example, let’s consider the following: SA = <A B D E M H > and CA = <A C D E Z H>. In this case, in order to make them match, we have to apply relabeling twice, one to node B, to make it C, and the other to node M, to make it Z. Using the above cost scheme, the cost of relabeling B to C is 2, because it’s a special case (SNR), while the cost of relabeling M to Z is 1, because it’s a basic case. Thus, the edit distance of the two sequences is d(SA,CA) = 2 + 1 = 3.

As second example, consider the following two sequences: SA = <A(0,5) C(1,4) D(2,3) G(3,4) F(3,4), K(0,7) > and CA = <A(0,5) C(1,4) D(2,3) M(3,2) F(3,4) > related to an A* exercise. To have a (exact) match between the two sequences, we have to apply a relabeling of G(3,4) to M(3,2) and a node deletion of K(0,7). Relabeling of node G to M has a cost of 1 as well as the cost of deletion of node K. Thus, the edit distance of the two sequences is d(SA,CA) = 1 + 1 = 2.

Categorizing Student Answers

The categorization of students’ answers is made for better modeling and assessment. To this end, a student’s answer is characterized in terms of completeness and accuracy. So, an answer is considered complete, if all nodes of the correct answer appear in the student’s answer; otherwise it is incomplete or superfluous. An answer is accurate, when all its nodes are correct; otherwise it is inaccurate. Based on them, we have categorized answers in five categories. The categorization is influenced by the scheme proposed in (Fiedler and Tsovaltzi 2003), however it is enriched in terms of superfluity (Gouli et al. 2006). The categories of student answers are the following:

  • IncompleteAccurate (IncAcc): All present nodes are correct, but they are a subset of the required and the edit distance is greater than 0.

  • IncompleteInaccurate (IncIna): Present nodes are a subset of the required and the distance is greater than 0.

  • CompleteInaccurate (ComIna): Only the required number of nodes are present, but the edit distance is greater than 0.

  • Complete-Accurate (ComAcc): Only the required number of nodes are present and the edit distance is equal to 0. So, the student answer is correct.

  • Superfluous (SF): Present nodes are more than the required; also, edit distance is greater than 0. We distinguish two further cases:

  • SF-LG: Last node is the goal node-state

  • SF-LNG: Last node is not the goal node-state (the student continues after having reached the goal node or fails to reach it).

In case that a student’s answer is characterized as complete and accurate, the student has specified the correct answer or a correct answer. In all other cases, the student’s answer is considered as incorrect and the student has to work towards a correct one. The student’s answer is further analyzed by the error detection unit, in order to recognize the types of the errors made.

The above used two example student sequences are categorized as, first example: ComIna, second example: SF-LNG.

Importance of Errors

An important factor in answer assessment is estimating the importance of errors made by a student. Not all errors that a student makes are of the same importance. Also, not all of them are due to lack of knowledge or understanding of the algorithms. Some of them are due to carelessness or inattention of the student. Carelessness can be defined as giving the wrong answer despite having the needed skills for answering correctly (Hershkovitz et al. 2013). In educational systems, carelessness is not an uncommon behavior of students, even among high performing students (San et al. 2011), so careless errors are made often by students. Even highly engaged students may paradoxically become overconfident or impulsive, fact that leads to careless errors (San Pedro et al. 2014). In this spirit, an important aspect of modeling student answers is to take into account such carelessness facts too. To this end, we consider that existence of only one error in an answer may be due to inattention.

So, we distinguish the following two types of single-error answers:

  1. I.

    Only one operation either regular or of special cases is required in order to achieve matching.

  2. II.

    Only one node relabeling operation for two consecutive nodes-states (that have been switched between each other) is required in order to achieve matching.

Notice that Type I represents answers that have a missing node or an extra node or a node that needs to be changed. Type II represents answers that have two nodes in incorrect positions.

Automatic Marking Mechanism

In this section, we present the automatic marking mechanism which is used to mark students’ answers to exercises. To accomplish this, automatic marking mechanism interacts with the error detection mechanism. We distinguish two types of exercises-answers as far as marking is concerned: simple and complex. Exercises whose correct answers include less than or equal to 6 nodes-states, are considered simple exercises-answers. The rest are considered as complex. This distinction is based on empirical data.

Simple Exercise-Answer Marking

Initially, the mechanism calculates the edit distance between the sequence of a simple student answer (SA i ) and the one of the correct answer (CA i ).

Then the mark, called student score (SS i ), of the simple answer SA i is calculated by the following equation:

$$ S{S}_i=\left\{\begin{array}{c}\hfill maxscore*\left(1-\frac{d\left(S{A}_i,C{A}_i\right)}{n_c}\right), \mathrm{if}\ d\left(S{A}_i,C{A}_i\right)<{n}_c\hfill \\ {}\hfill 0,\mathrm{otherwise}\hfill \end{array}\right. $$
(1)

where d(SA i , CA i ) is the edit distance, n c represents the number of nodes-states of (CA i ) and maxscore represents the marking scale. In case of a complex answer, the student score is calculated by the automated marking mechanism presented below.

Given a test including a number of q interactive exercises, the test score is calculated as the average score of the answers:

$$ Tes{t}_{Score}=\frac{{\displaystyle {\sum}_{i=1}^q}S{S}_i}{q} $$
(2)

Automated Marking Algorithm

Marking students’ answers, as mentioned above, is a complex and very demanding process for tutors, especially in cases of complex exercise-answers. Automated marking in AITS is based on the type of the student answer and the type of errors made, determined by the error detection mechanism. So, the marking mechanism tries to model and estimate the overall student understanding of the functionality of the algorithms. The marking algorithm presented below is based on empirical estimation and evaluation results as well as the principle of simulating tutor marking.

1. If simple, SS i is calculated via formula (1)

2. If complex

2.1 If student answer (SA i ) is of type ComIna

2.1.1 If SA i is a single-error answer of Type I

SS i =maxscore* (1-(w 1 *d(SA,CA i ))/n c )

2.1.2 If SA i is a single-error answer of Type II

SS i = maxscore* (1-(w 2 *d(SA, CA i ))/n c )

2.1.3 SS i is calculated via formula (1)

2.2 If student answer (SA i ) is of type IncAcc

2.2.1 If SA i is a single-error answer of Type I,

SS i = maxscore* (1-(w 1 *d(SA,CA))/n c )

2.2.2 SS i is calculated via formula (1)

2.3 If student answer (SA i ) is of type IncIna

2.3.1 SS i is calculated via formula (1)

2.4 If student answer (SA i ) is of type Superflous-Case SF-LG

2.4.1 If SA i is a single-error answer of Type I,

SS i =maxscore*(1- ( w 1 /n c ))

2.4.2 If SA i is a single-error answer of Type II,

SS i =maxscore * (1 - (w 2 * d(SA,CA)) / n c )

2.4.3 SS i is calculated via formula (1)

2.5 If student answer (SA i ) is of type Superflous-Case SF-LNG

2.5.1 SS i is calculated via formula (1)

2.6 Student answer (SA i ) is of type ComAcc, so SS i = maxscore

where we set the following values to the parameters: w 1  = 0.6 and w 2  = 0.8. Parameters w 1 and w 2 represent the importance of the errors in calculating the mark and are empirically set based on two tutors’ experience. Also, recall that based on empirical data we consider that for a sequence with less than 7 nodes, it does not make sense to take into account the type of the answer and of the errors made. In the context of our work the maxscore is set to 100 and the automated marking algorithm marks exercises on the 0 to 100 points scale.

Furthermore, the mechanism can specify and take into account cases where the student has not got adequate understanding of the algorithm and has given an answer in an in-consistent and in some sense random way. Special attention is paid to the cases where, particularly in blind search algorithms, the student has specified correctly just some nodes, but has not understood the algorithm’s way of function.

As a first example, consider the following case, where the initial state is node A and the correct answer is CA = < A B K E W Z M > for the Depth-First Algorithm. The student answer is the following: SA = < A B K E L D M H F >. Initially, error detection mechanism estimates the edit distance between SA and CA. The required operations are: two node relabelings (W➔ L and D ➔ Z) and two node deletions (H and F). The cost for node relabeling and node deletion is 1. So, the edit distance is d(SA,CA) = 2*1 + 2*1 = 2 + 2 = 4. Also, the Error Detection Mechanism detects the student’s answer as being superfluous, more specifically of SF-LG type. Thus, according to the marking algorithm, SS = 100–100*(4/7) = 42.

As a second example, consider the following case, where the initial state is node A and the correct answer is: CA = <A(8) K(5) M(3) L(2) D(2) C(1) N(0) > for a Hill Climbing search interactive exercise. The answer of a student is the following: SA = <A(8) B(6) F(8) M(3) L(2) D(2) C(1) N(0)>. Initially, the Error Detection Mechanism estimates the edit distance between student answer (SA) and correct answer (CA). The required operations are: second_node_relabeling (B(6) ➔ K(5)) and node deletion (F(8)). The cost for second_node_relabeling is 2, while the cost for node deletion operation is 1. So, the edit distance is d(SA,CA) = 2 + 1 = 3. Also, Error detection mechanism detects the student’s answer as SF-LG. Thus, according to the marking algorithm, SS = 100–100*(3/7) = 57.15.

As a third example, consider the following case, where the initial state is node A and the correct answer is CA = <A(8) M(6) L(5) C(4) D(3) G(2) I(0) > for a Best First Search interactive exercise. The student answer is the following: SA = <A(8) M(6) L(5) D(3) C(4) G(2) I(0)>. The required operations to achieve matching are two relabelings (D(3) ➔ C(4) and C(4) ➔ D(3)). The cost for node relabeling is 1. So, the edit distance is d(SA,CA) = 1 + 1 = 2. Also, Error Detection Mechanism detects the student’s answer as ComIna and single-error answer of Type II. Thus, according to the marking algorithm, SS = 100–100*((0.8*2)/7) = 77.15.

Finally, consider the following exercise for the Breadth-First search algorithm, as illustrated in Fig. 7. The correct answer (CA) is the following: <A B C D E F G H I J K >.

Fig. 7
figure 7

An Interactive exercise for Breadth First Search algorithm

In Table 1, student answers to the above exercise, the categorization of each answer and the corresponding score calculated by the automated marking algorithm are presented.

Table 1 Examples of Students’ Answers

General Assessment Framework

In an effort to specify a general framework for the assessment process, which could be used in other domains too, we ended up with the diagram of Fig. 8. It consists of four stages:

  1. 1.

    Specify similarity between student answer and correct answer

  2. 2.

    Categorize student answer according to the answer categorization scheme

  3. 3.

    Check for important errors

  4. 4.

    Calculate the mark via the automated marker

Fig. 8
figure 8

General framework of the assessment process

Indeed, the resulted framework can be used in different domains by proper instantiations of the following elements:

  • Similarity metric

  • Answer categorization scheme

  • Automated marking algorithm

The above elements as implemented in AITS could be easily used for domains where the answers are strings or could be transformed into strings.

Evaluation

Experimental studies were conducted to evaluate AITS and the assessment mechanism during learning. The main objective of the experimental studies was to obtain an evaluation of the effectiveness of AITS and also of the performance of the automatic assessment mechanism. For this purpose, two different experiments were designed and implemented in the context of the AI course in our department. In order to explore how the system assists students learn about search algorithms, an extended evaluation study through the pre-test/post-test and experimental/control group method was conducted in real class conditions. Furthermore, to evaluate the performance of the assessment system, we compare the assessment mechanism against an expert tutor on the domain of search algorithms.

Evaluation of the Assessment Mechanism

In this section, we describe the experiments conducted and the resulted findings of the automatic assessment mechanism. The automatic assessment mechanism has been used to assess students’ performance, while studying with AITS.

Experiment Design

Initially, for the needs of the study, we randomly selected 80 undergraduate students (both female and male) of our department that attended the artificial intelligence course, had used AITS and had taken exercises and tests related to blind and heuristic search algorithms. During the interaction of a student with AITS, all student’s learning actions and submitted answers were recorded and archived by the system. For the study, students’ answers were collected and a corpus of 400 answers that the students provided for various search algorithm exercises was formed. After that, three tutors having many years of experience in teaching search algorithms were asked to mark the corpus of students’ answers. The tutors jointly evaluated each student answer and marked it on a 100-point scale. Those tutors’ marks were used as a gold standard. Tutors’ marking process was completed before the automatic marking process was used to assess the student answers. So, the scores of the automatic marker were not known to the tutors, to eliminate any possible bias towards the automatic marker’s scores. After that, the automatic marker was used to assess each one of the corpus exercise. So, the students’ answers were assessed by both the tutors and the automated marker and thus two sets were formulated, one with tutors’ marks and the other with the system’s marks.

Results

After the formulation of the two datasets, we calculated the Pearson correlations between the marks (scores) of the tutor and the automated marker. The results showed that there was a very strong, positive correlation between tutor’s marks and automated marking system ones, which was statistically significant (r = .96, n = 400, p < .0005).

After that, a linear regression approach was utilized on the dataset with the students’ marks. Linear regression attempts to model the relationship between two variables, one independent and one dependent. We consider as independent variable the one representing automated marker scores and as dependent variable the one representing tutor’s marks. The objective of the regression is to offer us a way to specify the quality of the automated marker; it does not predict the students’ marks. Initially, we checked that the data is appropriate and meet the assumptions required for linear regression to give a valid result. For example, looking at the scatter plot (Fig. 9), we can see that the conditions for the two variables (automated marker, tutor marker) to be continuous are met: there is a linear relationship between the two variables and there aren’t significant outliers

Fig. 9
figure 9

Scatter Plot of automated marker vs tutor marker

A simple linear regression application to both corpuses of student answers resulted in the following equation: y = 9.697 + 0.894x with r 2 =  . 934. The results of the linear regression model indicated there was strong positive (Dancey and Reidy 2007) relationship between the automated marker and the human marker, since r = .967, while then r2 = .934 suggests that 93.4 % of the total variation in y can be explained by the linear relationship between x and y. This means that the straight line of the regression approach supports approximately 93.4 % of the variability in the data corpus. The model was a good fit for the data (F = 5656, p < .0005).

Then, we created three corpuses out of the corpus of 400 student answers, namely CorpusA, CorpusB and CorpusC. CorpusA consisted of student answers that their assessment levels were ‘very low’ and ‘low’, CorpusB of those with levels ‘medium’ and ‘good’ and CorpusC of those with level ‘excellent’. We characterize an answer assessment as ‘very low’ if its score specified by the tutor is up to 25, ‘low’ if its score is from 26 to 45, ‘medium’ if it is from 46 to 65, ‘good’ if it is from 66 to 85 and ‘excellent’ if the score is greater than 85. The purpose of the creation of three groups was to look at evaluation of answers assessment in terms of low, medium and excellent levels. Table 2 presents the correlations between the scores of human marker and the automated marker for the three corpuses. The correlation of corpusB was .896, higher than those of the other two corpuses.

Table 2 Results of correlation for three corpuses

Moreover, in Fig. 10 the scatter plots, resulted from the three groups, are presented. In fact, Fig. 10b shows a better agreement between tutor and system assessments for the medium level answers than the others, since data concentrate more in the vicinity of the corresponding line.

Fig. 10
figure 10

Scatter Plots for (a) CorpusA, (b) CorpusB, (c) CorpusC

A second experiment was designed to evaluate the automated assessment system, considering assessment as a classification problem. In this respect, we use appropriate metrics to evaluate system assessment in comparison with the tutor’s assessment. We analyze the corpus of the 400 student answers and discretize it using the above mentioned classification in ‘very low’, ‘low’, ‘medium’, ‘good’ and ‘excellent’ mark categories.

The evaluation is mainly based on three well-known metrics: average accuracy, precision and F-measure, given that we have a multiple class output consisting of five classes. Average accuracy is the mean value of the accuracies of the output classes and F-measure is defined as:

$$ F- measure=2\times \frac{precision\times recall}{precision+ recall} $$

Also, we evaluate the agreement between the tutor and the automated marker using Cohen’s Kappa statistic (Cohen 1960), which is defined as follows:

$$ k=\frac{p_0-{p}_e}{1.0-{p}_{e.}} $$

where p 0 is the proportion of rater exhibiting agreement and p e is the proportion expected to exhibit agreement only by chance. Thus, “perfect agreement” would be indicated by k = 1 and no agreement means that k = 0. Cohen’s Kappa was estimated to determine whether there is agreement between the grades of the tutor and the automated marker of the 400 student’s answers. The results indicate that there is substantial agreement (Viera and Garrett 2005) between the tutor and the automated marker, since κ = .707 (95 % confidence interval, p < .0005). Also, the performance of the automated assessment mechanism and the confusion matrix are presented in Tables 3 and 4 respectively.

Table 3 Evaluation results of automated assessment mechanism
Table 4 Confusion matrix of Automatic Assessment Performance

The results indicate that the automated assessment mechanism has an encouraging performance. From the corpus of 400 student answers that were marked by the automated marking mechanism, 332 of them were assessed in the correct mark level category. This means that in approximately 83 % of the cases the automated marking mechanism estimated correctly the mark category of the student answer. The analysis of the confusion matrix shows that the automated assessment system and tutor marks have much in common, but do not match exactly; the tutor in most cases assigned a higher score than the system. The fact is that automated marking cannot always indicate whether a student has deeply understood an algorithm and adapt the way of marking to current student(s). A tutor, however, can realize whether a student has understood the algorithm, in spite of his/her errors, and include this principle in the marking of the answers.

Evaluation of AITS Learning Effectiveness

Method

We conducted an evaluation study in which we compare teaching/learning with AITS versus the traditional way. The purpose of the study was to evaluate the effectiveness of learning in those two different ways. The participants in this evaluation study were 300 undergraduate students (both female and male) from the Artificial Intelligence (AI) classes at our department. All students were in the 4th year of their study and ranged in age from 21 to 24 years (M = 22.5). A pre-test/post-test experiment was used. So, we created two groups; the first one consisted of 150 students (70 female and 80 male) of the class of academic year 2011–2012, denoted by ClassA (control group), and the second consisted of 150 students (73 female and 77 male) of the class of academic year 2012–2013, denoted by ClassB (experimental group). The participants in each group were randomly chosen from a total of about 250 students in each year.

ClassA (control group) did not use AITS, but instead used the traditional learning/teaching process about search algorithms. The students attended lectures and videos on AI search algorithms, solved exercises given by the tutor individually and then discussed them with the tutor. ClassB (experimental group) was given access to AITS to study about AI search algorithms. The system provided different types of exercises and feedback during the interaction with it.

The experiment consisted of four phases: pre-test, learning phase, post-test and questionnaire, as illustrated in Fig. 11. The two groups followed the same procedure; both groups were given a pre-test and then ClassA learned about search algorithms with the traditional way, whereas ClassB through AITS; afterwards, both groups were given a post-test. After the post-test, all participants were given a questionnaire to fill in. The pre-tests and post-tests were isomorphic and incorporated structurally equivalent exercises on search algorithms.

Fig. 11
figure 11

Structure of the experiment

Results

In order to analyze the students’ performance an independent t-test was used on the pre-test. We consider the null hypothesis: there is no difference between the performances of the students of ClassA and ClassB on the pre-test. The results show that the mean value and standard deviation of the pre-test were 39.4 and 10.74 for ClassA (M = 39.4, SD = 10.74), and 38.9 and 11 for ClassB (M = 38.9, SD = 11.0) respectively. Also, the p-value (significance level) was p = .691 (p > .05) and t = .398 and the effect size d (Cohen 1988) was .046. So, it can be inferred that the two classes did not significantly differ prior to the experiment. ClassA and ClassB were almost of the same knowledge level about search algorithm concepts before starting the learning process. Also, Table 5 presents the means and standard errors of the pre-test and post-test scores for the two groups (control vs experimental)

Table 5 Results of performance of pre-test/post-test for each group

The results show that ClassB, while had a mean value of 38.9, it was increased to 67.54 in the post-test, whereas ClassA had a pre-test mean of 39.4 that was increased to 49.98 in the post-test. So, the results revealed that the mean value of the post-test for ClassB class was quite higher than the mean value of the post-test for ClassA.

To determine the effectiveness of learning, we conducted ANOVA with repeated measures to extract the difference between two conditions (control and experimental). An analysis of variance (ANOVA) was conducted with Test (pre-test, post-test) as a repeated factor and Group (ClassA, ClassB) as a between subjects factor. The results revealed a significant difference in learning performance between the conditions F(1298) = 235.319, p < .001 and the effect size = .44. Also, Fig. 12 presents how each group performed on the pre-test and post-test with each line representing a group.

In addition, we calculated the simple learning gains as posttest − pretest.Additionally, ANOVA performed on the simple learning gains showed significant differences among conditions, F(1298) = 235.319, p < .001, MSError = 103.87. In addition, we calculated the normalized learning as following:

$$ \frac{posttest- pretest}{1- pretest} $$
Fig. 12
figure 12

The means of the pre-test and post-test for each group

ANOVA performed on the normalized gains showed significant differences among conditions, F(1298) = 91.28, p < .001, MSError = 282.83. Finally, the results showed that the performance of the students of ClassB, who interacted with AITS, was better than that of ClassA.

Survey Questionnaire

All the participants (ClassA and ClassB), after the post-test, were asked to fill in a questionnaire regarding the students’ experiences and opinions about the system’s learning impact and also the automated assessment mechanism. The questionnaire for ClassB consisted of 12 questions, where ten questions required answers based on a five point Likert scale (1- strongly disagree to 5-strongly agree) and two were open ended questions. The open ended questions were provided at the end of the questionnaire to allow students to write their comments about AITS and automated marking mechanism, stating their experiences and opinions. Table 6 presents the mean and the standard deviations of the students’ responses for ClassB.

Table 6 Results of the Questionnaire

The results of the questionnaire are very encouraging for learning with AITS and for marking with the automated marker. After analyzing the students’ responses to the questionnaires, the reliability of the questionnaire was checked using the Cronbach’s alpha (Cronbach 1951). The reliabilities of the scales were good with internal consistency coefficients α = .78 for the students of ClassB.

Additionally, a questionnaire survey was made to evaluate the efficiency of the automated marker. We used the same three tutors as above, having many years of experience in dealing with search algorithms, who were asked to rate a corpus of students’ answers that were marked by the automated marking mechanism. The range of the score was from 0 to 5. The tutors were given 10 students’ answers that had been marked by the automated assessment mechanism and were asked to rate the quality of the marks, this time independently of each other. The rating results provided by the tutors are presented in Fig. 13.

Fig. 13
figure 13

Tutors’ rating for the quality of marking system

The average rating scores provided by the three tutors were, M = 4.38 by tutor1, M = 4.47 by tutor2 and M = 4.39 by tutor3. The rating scores indicate that the tutors found the automated marking mechanism to be appropriately helpful resulting in an overall average score of 4.41.

Conclusion and Future Work

AITS is an adaptive and intelligent tutoring system used for assisting students in learning and tutors in teaching artificial intelligence curriculum aspects, one of them being “search algorithms”. The system offers theory descriptions, but most importantly interactive examples and exercises related to search algorithms. A student can study the theory and the examples, which use visualized animations to present AI search algorithms in a step-by-step way, to make them more understandable and attractive. Also, it provides interactive exercises that aim to assist students to learn to implement algorithms in a step-by-step interactive approach. Students are called to apply an algorithm to an example case, by specifying the algorithm’s steps interactively and with the system’s guidance and help. Also, it provides immediate feedback for the interactive exercises and the tests. According to our knowledge, it is the first time that such technologies and methods are used in teaching/learning about search algorithms.

In the context of AITS, we introduce an automatic assessment mechanism to assess the students’ answers. Automatic assessment is achieved in a number of stages. First, the system calculates the similarities between a student’s answer and the correct answer using the ‘edit distance’ metric. Afterwards, it identifies the type of the answer, based on its completeness and accuracy as well as taking into account carelessness errors. Finally, it automatically marks the answer, based on the answer’s type and the edit distance and the type of errors, via the automated marking algorithm. So, marking is not based on a clear cut right-wrong distinction, but on partial correctness and takes into account carelessness or inattention cases. In this way, accuracy and consistency are achieved in a large degree, by avoiding subjectivity of human marking. Again, it seems that it is the first effort that specifies a systematic categorization of student answers taken into account, apart from correctness and consistency, also carelessness and inattention. Additionally, it is the first time that an automated assessment process is introduced for exercises on search algorithms. On the other hand, the introduced process constitutes an adequately general assessment framework that could be applied to other domains too. Furthermore, the automated marking algorithm itself could be used as the basis for marking answers to exercises of other domains, given that they are expressed as strings.

We conducted two experiments to evaluate (a) the performance of the automated assessment mechanism and (b) the effectiveness of using interactive exercises with visualized step-based animations for search algorithms in AITS on learning. In the first experiment, to evaluate the performance of the automated assessment, a data set of 400 student answers, marked by the system and jointly by three expert tutors, was used as a test bed. Experimental results, analyzed via linear regression and classification metrics, showed a very good agreement between the automatic assessment mechanism and the expert tutors. So, the automatic assessment mechanism can be used as a reference (i.e. accurate and consistent) grading system. In the second experiment, we evaluated learning effectiveness of AITS through a pre-test/post-test and experimental/control group approach. The results gathered from the evaluation study are very promising. The experimental group made clearly better than the control group. So, it seems that visualized animations and interactivity are two crucial factors that contribute in better learning, at least for subjects like search algorithms.

At the moment, the implementation of the interactive examples and visualizations is quite time consuming. So, a way for semi-automatic or automatic generation of such learning objects (actually programs) is a quite interesting direction for further research. On the other hand, investigation of further improvement of the assessment mechanism may be necessary towards a number of directions. For example, investigation of more criteria, like graph connectivity, for specification of special error cases is one of them. Also, assessment of errors based on user modeling, to involve domain knowledge, is another possible direction. Finally, an interesting direction would be to test the assessment mechanism on other types of algorithms. Exploring this aspect is a key direction of our future work.