1 Introduction

University education is essential for the advancement of any society and has a major role in realizing comprehensive and sustainable development goals. Therefore, there is a pressing need to advance university education, especially the contemporary challenges and the heightened competition among universities to accredit their institutions and programs by the National Authority for Quality Assurance and Accreditation of Education (NAQAAE). NAQAAE identified eleven standards, including the standard of education and learning, which, in its components, indicated the importance of assessing in terms of education quality. Moreover, the excellence of educational programs is based on the assessment practice of Köksal and Ulum (2018). From this perspective, there is an increasing interest in the educational evaluation processes in general and student assessment in particular. Despite the multiplicity and diversity of the tools and methods used for the evaluation of student performance and their cognitive levels, the written exam paper prepared by academics is still considered the key tool and the approach widely used by higher education institutions, even now (Kumara et al., 2019; Ndirangu et al.; Omar et al., 2012; Sangodiah et al., 2022). The exam paper has a significant and effective part in learning, as it is one of the student evaluation dimensions (Köksal and Ulum (2018) and as it is utilized to identify student learning curves and measure student learning outcomes in the cognitive domain (Ndirangu et al.; Omar et al., 2012). Therefore, it can be said that exam papers are, in essence, directly linked to the evaluation of graduate quality, being the main criteria to ensure the education quality level of the students brought out by the institutions Timakova and Bakon (2018). Hence, the quality of an exam paper is crucial to fulfilling the purpose of the assessment (Saha, 2021). Accordingly, it is extremely important to prepare and design exam papers according to the standards of the quality of the exam papers, with a focus on two main aspects: i) the formal specifications of the exam paper and ii) the content specifications of the exam paper, in terms of its representation and coverage of the different levels of knowledge (Chirumamilla et al., 2020; Malinka, et al., 2023). In this regard, Bloom's Taxonomy, established by Benjamin Bloom in the 1950s, has been widely used in the educational environment to measure, evaluate, and write effective, high-quality, and balanced exams, that cover the different cognitive levels, assessing different student skill and are highly consistent with the intended learning outcomes (Alammary, 2021; Das et al., 2020; Mohammed & Omar, 2018). Bloom's taxonomy includes six cognitive levels arranged in a hierarchy. The first three levels refer to the lower-order thinking skills (LOTS), which include knowledge, comprehension, and application, whereas the next three levels refer to the higher-order thinking skills (HOTS) that contain analysis, synthesis, and evaluation (Köksal & Ulum, 2018; Kumara et al., 2019; Osman & Yahya, 2016).

Academics must consider BCLs when generating exams, to facilitate standardization and to ensure that the exam will achieve a proper balance between HOTS and LOTS, (Alammary, 2021) which Bloom proposed to be distributed as follows: knowledge (45%), comprehension (10%), application (20%), analysis (10%), synthesis (10%) and evaluation (5%). But, academics still face a major challenge in this context. One of the main issues to address in this regard is how to distinguish the cognitive level addressed by every question. Moreover, some academicians haven’t background to distinguish between BCLs, Mohammed and Omar (2018), which may lead them to put questions that require remembrance and rarely put questions that require reasoning (Yahya et al., 2013) which eventually results in a poor exam quality Mohammed and Omar (2018). Hence, there is an urgent need to evaluate the exam papers and verify how much they meet the standards of exam quality. It is worth noting that exam papers are still manually evaluated until now, and it may be a very tedious, time-consuming, and challenging routine for academic staff (Timakova and Bakon (2018). This practice may not ensure correctness and lacks consistency. Moreover, academics often focus mainly on evaluating only the formal specifications of exam papers. However motivated by the importance of categorizing exam questions into BCLs as another major dimension in the exam paper assessment, this paper presents an attempt to automate this process in an integrated manner using AI techniques. Recently, researchers tend to utilize Machine Learning (ML) techniques to classify the questions automatically (Mohamed et al., 2019) where, the ML ability improves the performance of question classification (Yahya et al., 2013). All of these factors mentioned above reflect the importance of the problem domain selected to conduct this research. Therefore, this paper has developed a novel intelligent system based on one of the ML, Logistic Regression (LR), and some image processing techniques to evaluate the exam paper scientifically and accurately, in the light of the quality specifications of the exam paper. The proposed system can be an effective alternative to overcome the issues and challenges associated with the manual evaluation of exam papers. Also, academics help modify their exam questions according to Bloom's standard percentages to obtain a balanced, high-quality exam.

The rest of this paper is structured as follows. Section 2 proposes the associated work. Section 3 introduces the proposed model. Sections 3.2.3 and 5 discuss the experimental results of the suggested model and comparison models, as well as their management implications. Section 6 concludes the results and recommends the next work.

2 Related work

As it has been introduced, there is an urgent need to automate the evaluation task of exam papers' quality in form in addition to the content, to verify whether the exam questions cover the different cognitive objectives according to the BCLs. in this regard, several studies have been reported to have focused on classifying questions automatically according to BCLs. However, to the knowledge of the researcher, no single study researched the automation of the task of evaluating the exam paper in terms of formal and content specifications together to replace the manual method practiced by academics. The current study attempts to fill this research gap. Hence this study can be considered a main contribution to this specific research field. As regards classifying questions automatically, systems were developed to classify exam questions based on BCLs using various techniques and approaches. Some researchers used the rule-based approach while others used machine learning approaches. (Kumara et al., 2019) proposed a rules-based model to categorize computer science exam questions according to Bloom's Taxonomy. They used NLP techniques to identify the significant keywords and verbs that are useful in the determination of the suitable cognitive level. The combination of the developed rules and Bloom’s Taxonomy proved that the developed model categorized the questions well. (Alammary, 2021; Jayakodi et al., 2016) proposed a method to classify exam questions into BCLs using the rule-based approach. The authors used WordNet and cosine similarity to develop a rule-based classifier. They used NLP techniques such as tokenization, stop word removal, lemmatization, and tagging before generating the rule set. The results indicated accuracy reached 70%. Many studies indicated that ML-based question classification overcomes the limitations of rule-based question classification. The rule-based approach takes a long time to build a lot of rules manually. Moreover, although the techniques that depend on rules may achieve a good performance on specific datasets, they are likely to perform quite poorly on a new dataset, and consequently, it is difficult to scale them. Conversely, the machine learning approach provides a relatively easier way to categorize questions; Instead of writing the rules manually, it can learn from a particular training set and as a result, this type of approach can be adapted to new data with minimal effort. Its overall performance is also better and more accurate than the rule-based approach (Osadi et al., 2017; Razzaghnoori et al., 2018). So, some researchers used machine-learning methods to categorize exam questions based on BCLs (Sangodiah et al., 2022). Applied two ML-based classifiers (Support-Vector-Machine (SVM) and Naïve Bayes) with feature engineering techniques (TF, TF-IDF, and normalized TF-IDF) to categorize exam questions according to Bloom Taxonomy. This study achieved an accuracy rate reached of 73.3%. (Osadi et al., 2017) developed a model for the classification of programming questions with the use of an ensemble classifier technique, which combines 4 different algorithms: a rule-based classifier, the support-vector-machine, the k-nearest neighbor, and Naive Bayes, using majority voting and WordNet similarity values. They used a word vector method for feature extraction. This study achieved overall 82% accuracy. (Mohammed & Omar, 2020) used three Machine Learning-based classifiers (KNN, LR, and SVM) and 2 feature extraction techniques (TF-IDF, Word2Vec) to classify questions according to the cognitive domains of Bloom's taxonomy. They used 2 sets of data containing 141 and 600 questions, respectively. The average accuracy for the first dataset was 83.7% and 89.7% for the second dataset. The results of that study indicated that the presented techniques are significant in classifying questions from multiple domains based on Bloom’s taxonomy Osman and Yahya (2016) applied four ML-based classifiers (Naïve Bayes, SVM, LR, and decision tree) to categorize exam questions automatically according to the cognitive levels of Bloom’s taxonomy. They classified exam questions using linguistically motivated features (bag-of-words, part-of-speech (POS), n-grams). The authors used a database that contained 600 exam questions for English language course. The findings indicated that the ML techniques combined with linguistically motivated features achieved a satisfactory performance in classifying exam questions automatically according to BCLs.

3 Proposed system

This paper develops an intelligent system based on image processing (IP) and ML techniques, to evaluate the exam paper using special standards with an exam paper ‘quality regarding form and content. The proposed system was implemented using the Python 3.7.6 programming language. This system consists of two subsystems: the Exam Paper Formal Evaluation Sub-System and the Exam Questions Classification Sub-system.

3.1 Exam paper formal evaluation sub-system

Figure 1 shows the structure of the exam paper's formal evaluation sub-system, which is mainly based on IP techniques. It uses several main libraries for execution (Image from PIL, Pytesseract, cv2, os, re). This sub-system is developed to verify that the exam paper meets the formal specifications and criteria (the 14 specifications shown in Table 1

Fig. 1
figure 1

The structure of the exam paper formal evaluation sub-system

Table 1 Formal specification of the exam paper

3.1.1 First exam page identification and basic data verification module

The first page of the exam contains the basic data in the header and footer, as shown in Table 1. The proposed model identifies the first exam page and verifies whether it contains all the basic data. This goal can be achieved in several phases, as follows:

  1. A)

    Image Pre-processing Phase

In this phase, the exam pages are converted to an appropriate image format that can be processed, using one of the available websitesFootnote 1 to maintain image quality and adapt it for processing. This phase consists of two main steps: image Resizing and text ROI extraction.

  • Step 1: Image Resizing: After reading the exam images, they are resized into a fixed size of 1240 × 1754 pixels using an image. resize () function in Python.

  • Step 2: Text ROI Extraction: The text region of interest containing the basic data of the exam (which always exists on the header of the first exam page) is extracted using an image.crop() function with the coordinates (0, 0, width, 745).

  1. B)

    Text Recognition Phase

This phase is based mainly on Pytesseract as an open-source text recognition engine, used to process the image and recognize the text. In this phase, the extracted text ROI is translated into a text string using the Pytesseract.image_to_string () function to obtain a list of the recognized texts.

  1. C)

    Comparison Phase

This phase consists of two basic steps:

  • Step1: Create the Bag of Words (BOW), which contains the basic data that should be available at the head of the first exam page, as indicated in the following Python representation:

  • bow = [ 'University', 'Faculty', 'Department', 'Semester', Academic year, 'Course name', 'Course code, 'Exam date, 'Exam time', 'Total degree, 'Number of pages'].

  • Step2: Matching the BOW with the extracted list of recognized texts, which gives us one of the following two cases:

    • Case 1: A match of any of these words will be found, which indicates that this page is the first exam page. In this case, the system will check whether it contains all the basic data required at the page header or not, by matching all BOW words with the list of recognized texts. This will give one of the two results:

  • Matching for all words, which will indicate the validity of the first exam page and that it contains all the basic data required at the header.

  • No matching for some words, which will indicate the incorrectness of the first exam page and that it doesn't contain all the basic data required at the header.

Case 2: No match will be found for any of these words, which will indicate that this is not the first exam page. Accordingly, other pages will be read one after the other and the above phases and steps will be re-executed until the first exam page is found and validated. Figure 2 shows a part of the Python representation of the matching phase.

Fig. 2
figure 2

Python representation of the matching phase

3.1.2 University logo validation module

This module is designed to validate the university logo appearing on the header of the first exam page. This is done through the following two phases:

A) First page Image Pre-processing Phase:

This phase is performed in two steps:

  • Step 1- Image Resizing: The image of the first exam page is read and then resized to a fixed size of (1000 × 1000 pixels) using the (cv2.resize) function. This size achieves the best results and was reached based on several trials.

  • Step 2- Extracting ROI “University Logo": the ROI in the image (university logo) is extracted and cropped to occupy the coordinates [37:85, 69:150] at the head of the first exam page and used to represent the target image.

B) Matching phase:

In this phase, the target image is matched with the university logo image template using cv2.matchTemplate() function with the matching method (TM_CCOEFF_NORMED). In this template matching method, the similarity between the target image and the template is calculated and the result of similarity is returned. If the result is larger than 80%, this will indicate that the target image is the correct logo image. Otherwise, the logo image is incorrect.

3.1.3 Last exam page identification and basic data validation module

This module aims to identify the last exam page and verify whether it contains the basic data of the exam footer as shown in Table 1. This module is mainly based on the Pytesseract library. To achieve this aim, the following steps were implemented:

  • 1: Read the images of exam pages one after the other, except for the first page.

  • 2: Translate each image into a text string using the (Pytesseract.image_to_string) function and store it in a text file.

  • 3: Search inside this text file for the text strings “Best wishes”, or “End of exam” in addition to the text string “Examiners' Panel” using the Find() function. If any of these text strings is found, it will be identified as the last exam page, and validated.

3.1.4 Total grade validation module

In this module the headings of the main questions in all exam pages and the grade of each question are searched, to make sure that the sum of all question grades equals the total exam grade. This module is based mainly on the re library, which provides a set of functions for matching regular expressions. In this module, several steps are executed:

  • 1: Reading the image of the exam pages one by one, and translating it to a text string using pytesseract.image_to_string() function.

  • 2: Searching using the re.findall() function inside the extracted text string for any case matching with the regular expression: (Question [0-9]'+'[^]*\)') created using the re. compil() function, and then the findall() function will return all matches as a list of the string as follows:

  • # [Question 1 (n mark), Question 2 (n mark)………….], where n is the question grade

  • 3: After that, the resulting extracted text string will be split using the split() function based on the parameter “(“, and as a result, we get a list of text containing the heads of the main questions and another that includes the grades for each question.

  • 4: Finally, the sum of all question grades can be validated against the total exam grade.

3.2 Exam questions classification sub-system

This sub-system adopts an ML-based approach to classify exam questions into their corresponding BCLs. The sub-system comprises two major phases: training and testing, as shown in Fig. 3. In the training phase, three modules work. These are a collection of questions, question text pre-processing, and feature engineering, all of which involve the testing phase, which also includes another module (classification).

Fig. 3
figure 3

The structure of exam questions classification sub-system

3.2.1 Questions collection module

A set of questions were collected in the information technology area. Many questions were also collected from external sources such as textbooks and various websites to augment the training data sources. The total number of questions was about 900, all of which were short essay questions. Some questions contain interrogative terms, while the others contain Bloom verbs. All questions were manually categorized into BCLs by pedagogical experts.

3.2.2 Questions text pre-processing module

Generally, the pre-processing phase applied to unstructured datasets is highly important in all systems that use an ML framework to obtain an acceptable performance and better quality in text classification (Mohammed & Omar, 2020; Raza et al., 2019). The semi-structured/unstructured raw question texts contain a large quantity of needless data that do not play any important role in the classification process and reduce its accuracy. Moreover, it requires a longer time during training. So to improve the results of the proposed algorithm and increase the accuracy of classification, many preprocessing steps for the question text were conducted as follows:

  • Step 1: Punctuation removal: Firstly, all types of punctuations are removed from the question text, which includes ".;,: ? ! \— [] / (Ababneh, 2022) () ' " " ' ' … * ".

  • Step 2: Tokenization: is the process of breaking up a stream of textual content into words, symbols, terms, numbers, or other meaningful elements called tokens Vijayarani and Janani (2016). This operation has been applied to the question text depending on whitespace characters, such as a space or line break using the “word tokenize ()” method in the Natural Language Toolkit (NLTK). So, for example, the Question text:

What is a computer network?

After the removal of the punctuation "?", it can be tokenized, where each token is enclosed in a single quotation mark as follows:

'What' 'is' 'a' 'computer ' 'network '.

  • Step 3: Stop-words removal: In this study, stop words have been removed using a stop-word dictionary ​of the English Language which includes about 150 words like ('the', 'a, 'an', "this', 'that' etc.), prepositions ('at', 'by', 'for', 'from', 'of ', etc.), pronouns ('he', 'she', 'it', 'them', etc.), and verb particles ('am', 'is', 'be', 'was', etc.).

  • Step 4: Case conversion: after stop words are removed, the tokens are converted to lowercase.

  • Step 5: Stemming: is the process of removing the affixes from words or transforming inflection in words to their "root" forms such as mapping a group of words to the same Stem (Roy et al., 2020; Umer et al., 2021). In this manner, multiple different words can be normalized into a single morphological form. ​For instance, converting "computer" and "computing" both into "computer". This process is essential and has a positive efficiency and performance effect. Also, it helps reduce feature complexity and improves the learning capability of classifiers (Ababneh, 2022; Umer et al., 2021).

In this study, word stemming is performed to refine the question text even more using the Porter stemmer algorithm in the NLTK toolkit because it is one of the most popular English rule-based stemmers (Guia et al., 2019) So, in the current example, the text of the question has been reduced to:

'what' 'comput ' 'network '.

Also, in this study lemmatization (WordNetLemmatizer algorithm) was tested and compared with stemming (Porter stemmer algorithm), and it concluded that stemming produces the best output.

3.2.3 Feature engineering module

Feature engineering is a mechanism for transforming raw text into numerical feature vectors that can be used for machine learning models (Godavarthi, 2021; Sabri et al., 2022). This module aims to convert question texts into appropriate and understandable feature vectors (vector representations) for the machine learning algorithm, that expects a two-dimensional matrix will be presented at the input, the rows of which are concrete instances, and the columns are attributes or features). In this study, CountVectorizer () method provided by the scikit-learn library in Python was used to vectorize the questions and to extract feature vectors from question text based on the frequency (count) of each word that occurs in the entire question.

Assuming that, there are 3 questions text samples as follows:

  • Questions = [ ' List two types of networks in terms of range. ',

  • ‘Determine the major benefits of computer networks',

  • 'What are Peer-to-Peer networks and Server-Based networks?'].

  • Questions after pre-processing = [ ' list two typ network term rang '.

  • ' determine the major benefit of computer network.

  • ' what peer-peer network server bas network'].

CountVectorizer method produces a matrix in which each unique word is represented by a column of the matrix, and each question sample is a row in the matrix. The value of each cell is the count of the words in that particular question sample, as can be seen in Table 2.

Table 2 The result of the vectorization process
Table 3 A sparser matrix

It is worth noting that the above-mentioned words will not be stored in the form of strings but will take a specific index value. In this case, ‘bas' would have an index 0, ‘benefit' would have an index 1, ‘comput' would have an index 2, and so on. The value 2 in (Q2, w7) indicates that the word 7 appears two in Q2. Meanwhile, the value 0 in (Q1, w1) indicates that word1 does not appear in question Q1. This way of representation is known as a Sparse Matrix (see Table 3).

As a result, feature vectors for each question are obtained (See Fig. 4). Finally, the feature vectors are stored in the database.

Fig. 4
figure 4

The feature vectors of example questions

3.2.4 Question classification module: LR classifier

The supervised ML, LR algorithm was utilized in many different text classification applications and has achieved significant outcomes and performed well. LR algorithm depends upon the logistic function (that is, the sigmoid function Mohammed and Omar (2020), which is an S-shaped curve that takes any real-valued number and places it between 0 and 1 (see Fig. 5) (Raza et al., 2019).

Fig. 5
figure 5

Standard logistic function

In this study, LR algorithm is used in classifying exam questions according to BCLs. Due to dealing with a multi-category classification problem, the Logistics regression algorithm applied the one-versus-rest approach. For example, to predict the question class, six binary classification problems are considered i.e. whether the class is knowledge, comprehension, and so on with the rest of the classes. The Maximum Likelihood Estimation is used to assign the predicted class and it is implemented by Eq. (1) Mohammed and Omar (2020). Also, the multi-class LR algorithm has been optimized by minimizing a loss (cost) function, measuring the error between predictions and the true labels using cross-entropy. The logistic regression algorithm was also tested with different sets of parameter values to find the parameters that performed best in classifying the questions. The proposed algorithm achieved the best performance with the parameter values shown in Table 4:

Table 4 The parameters of a logistic regression model
$$p\left(c\left|x\right.\right)=\frac{e\left({\sum }_{i=1}^{N}{w}_{i}{f}_{i}\left(c,x\right)\right.}{\sum_{\acute{c}\in c}e({\sum }_{i=1}^{N}{w}_{i}{f}_{i}\left(\acute{c} ,x\right)}$$
(1)

Finally, after completing the exam questions classification using the developed classification model, the proposed system calculates the percentage of each category represented in the exam paper using Eq. (2) below:

$$BCLs \left(\%\right)=\frac{\mathrm{Number\;of\;questions\;for\;a\;specific\;cognitive\;level}}{Total\;number\;of\;questions\;in\;the\;exam\;paper}$$
(2)

4 Experimental methodology

This section discusses the experimental methodology applied in the current study to measure the performance of the developed system as well as the results obtained and the related discussions. In this regard, two approaches have been applied:

4.1 Evaluation approach of the proposed classification model

Initially, the data set used to test and evaluate the proposed classification model based on the LR algorithm is described, to classify exam questions according to BCLs. Then, the metrics used for assessment were described and finally, the results obtained were outlined and discussed.

4.1.1 Experimental dataset

The dataset collected in the current research consists of 900 questions in the English language related to the information technology area and has been classified by experts into six categories (Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation). After that, the data set for each category was divided into a training set (90% of the data set) to train the model, and a testing set (10% of the data set), to test the performance of the proposed model. Table 5 presents the statistics for the collected data set.

Table 5 Dataset statistics

4.1.2 Performance evaluation metrics

To evaluate the performance of the proposed classification model, tenfold cross-validation was applied to train, and test data. In this study, tenfold cross-validation was used to assess the accuracy of the predictions and to calibrate the model's hyperparameters. This means that the training process is run ten times, using a random selection of 90% of the training dataset for training and the remaining 10% for validation, as shown in Fig. 6. The effectiveness (Accuracy and F-measure addition to Error-rate of the proposed or final model has been examined by applying it directly to the unseen testing data) has been assessed after training and validation.

Fig. 6
figure 6

Training and testing processes of the proposed and other models

These metrics can be calculated using Eqs. (37) (Ababneh, 2022; Diab & Sartawi, 2017; Occhipinti et al., 2022; Singh et al., 2022).

  • True positives (TP): the total number of questions classified by the system under the category in a correct manner.

  • True negative (TN): the total number of questions that are not classified by the system under the category in a correct manner.

  • False positives (FP): the total number of questions a model incorrectly classify to the category,

  • False negatives (FN): the total number of questions that belong to the category but which the model does not classify to the category.

    $$Precision=\frac{TP}{TP+FP}$$
    (3)
    $$Recall=\frac{TP}{TP+FN} (1)$$
    (4)
    $$F-measure=2*\frac{\left(Recall*Precision\right)}{\left(Recall+Precision\right)}$$
    (5)
    $$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
    (6)
    $$Error-Rate=\frac{FP+FN}{TP+TN+FP+FN}$$
    (7)

On the other hand, to estimate the overall effectiveness of a classification model, the macro-average and the micro-average can be measured for all the categories using the following equations (Al-Salemi et al., 2019; Chang et al., 2008; Qian et al., 2007; Yang et al., 2012; Zhu & Wong, 2017):

Macro-averaged Precision, Recall, and F-measure

$${P}_{macro}=\frac{{\sum }_{i=1}^{\left|c\right|}{p}_{i}}{\left|C\right|}, {R}_{macro}=\frac{{\sum }_{i=1}^{|c|}{r}_{i}}{|C|}, {F}_{macro}=\frac{{\sum }_{i=1}^{|c|}{F}_{i}}{|C|}$$
(8)

Micro-averaged Precision, Recall, and F-measure

$${P}_{micro}\frac{{\sum }_{i=1}^{\left|C\right|}T{P}_{i}}{{\sum }_{i=1}^{\left|C\right|}\left(T{P}_{i}+F{P}_{i}\right)},$$
(9)
$${R}_{micro}\frac{{\sum }_{i=1}^{|C|}T{P}_{i}}{{\sum }_{i=1}^{|C|}(T{P}_{i}+F{N}_{i})},$$
(10)
$${F}_{macro}=\frac{2*({P}_{micro}*{R}_{micro})}{({P}_{micro}+{R}_{micro})}$$
(11)

where pi, ri, and Fi are precision, recall, and F-measure for category ci, |C| is the total number of categories, TPi, FPi, and FNi are true positive, false positive, and false negative for category ci. Zhu and Wong (2017).

As both the training data and testing data are randomly chosen, and often imbalanced, the weighted recall, precision, and F1-measure will be computed. The equations for calculating the weighted precision, weighted recall, and weighted F1-measure are (Ababneh, 2022; Mohammed & Omar, 2020):

$$Weighted\;Recall=\frac{{\sum }_{i=1}^{n}(recal{l}_{i}*suppor{t}_{i})}{{\sum }_{i=1}^{n}suppor{t}_{i}}$$
(12)
$$Weighted\;Precision=\frac{{\sum }_{i=1}^{n}(precisio{n}_{i}*suppor{t}_{i})}{{\sum }_{i=1}^{n}suppor{t}_{i}}$$
(13)
$$Weighted\;F1-measure=\frac{{\sum }_{i=1}^{n}(f1-measur{e}_{i}*suppor{t}_{i})}{{\sum }_{i=1}^{n}suppor{t}_{i}}$$
(14)

where support is the actual number of questions in each class (taken from the test dataset).

4.1.3 Experimental results

This section will address the experimental findings achieved using the developed classification model, which are summarized in Tables (6, 7, 8) and Figs. (7, 8, 9).

Table 6 Overall prediction results for each BCL’s class in the testing set
Table 7 The values of Micro (Macro)-average overall BCLs classes
Table 8 The values of (weighted average precision, weighted average recall, and weighted f-measure) overall BCLs classes in the testing set
Fig. 7
figure 7

Overall evaluation results for each BCLs class in the testing set

Fig. 8
figure 8

Results of Micro (Macro)-average overall BCLs classes in the testing set

Fig. 9
figure 9

Results of weighted average over all BCLs classes in the testing set

Overall, the proposed classification model achieves on average an accuracy of 98.5% in classifying exam questions for all BCLs classes. As shown in Table 6 and Fig. 7, it is noted that the proposed model achieves the best result and the highest performance at the synthesis and evaluation levels. They achieve 100% accuracy. The overall accuracy for knowledge, Comprehension, Application, Analysis, Synthesis, and evaluation was 98.5%, 97%, 98%, 100% and 100%.

The model obtained an average of 0.97%, 96%, and 96% in Precision, Recall, and F-measure, respectively. Also, the error rates showed a reduction in the error rate (0.015). Additionally, as illustrated in Table 7 and Fig. 8, the Micro average of precision, recall, and the F-measure were 96%. The Macro average of precision, recall, and the F-measure were 97%, 96%, and, 96% respectively. Furthermore, in Table 8 and Fig. 9, it is clear that the proposed model achieved a good performance, obtaining the same average in terms of the weighted average of precision, recall, and the F-measure, which reached 0.959%. Based on the all results shown in the above Tables and figures, it can be concluded that the LR algorithm-based proposed classification model is effective and achieves good results in classifying the questions of the exam paper according to BCLs, using the collected dataset.

4.2 Exam paper evaluation approach: Manually (experts) vs. automatically (proposed system)

To verify the effectiveness and accuracy of the proposed system in terms of automating the evaluation of the content and formal specifications together of the exam paper an evaluation of a component sample was conducted from 5 exam papers automatically by the proposed system and manually by several relevant experts and specialists. Table 9 represents the results of the automatic evaluation versus the manual evaluation of the content specifications of the five exam papers tested. The proposed system displays the result of the evaluation of the exam paper, with a display of the percentage of each cognitive category represented in the exam paper.

Table 9 Results of automatic evaluation vs manual evaluation of the content specifications of the exam paper

Table 9 above clearly indicates the extent of congruence and relative convergence in evaluation results between the proposed system and the experts and the specialists in the task of evaluating the content specifications of the exam paper. Thus, the proposed model can be applied to classify the exam questions according to BCLs high accurately. Determining the relative weight of each category represented in the exam paper. Based on this, academics can modify their exam questions based on Bloom's standard percentages to obtain a balanced, high-quality exam. Table 10 shows a comparison between the results of the evaluation of the proposed system versus the expert’s evaluation of the five tested exam papers in terms of formal specifications.

Table 10 Results of automatic evaluation vs manual evaluation of the formal specifications of the exam paper

It is clear from above Table 10 that there is significant agreement between the proposed system and the experts in the task of evaluating the formal specifications of the exam paper. According to all findings mentioned, the study concluded the possibility of applying the proposed system for evaluating the exam paper in light of the special standards of a good exam paper regarding form and content.

4.3 Comparison of the proposed classification model with the related works

An analysis was conducted to compare the performance of the proposed classification model with some other models, the results of which were reported in the literature, to classify exam questions according to BCLs. Table 11 summarizes the results of this comparison in terms of accuracy.

Table 11 Comparison between the proposed model and works in the literature

From Table 11, it is clear that the proposed method has outperformed other related work, which reported a classification accuracy of 73.3% in (Sangodiah et al., 2022), 89.7% in Mohammed and Omar (2020), 82% in (Osadi et al., 2017). and 70% (Jayakodi et al., 2016) whereas the proposed model has higher accuracy of 98.5%.

5 Discussion

This section integrates our research outcomes with prior studies, explores the practical ramifications and constraints of our work, and identifies avenues for future inquiry, offering a thorough examination that highlights the importance of our contributions within the field of educational technology.

Our research underscores the transformative capability of automated evaluation systems in enhancing higher education assessments. Through a comparative analysis of our system's performance against conventional manual grading techniques, we noted considerable improvements in both grading efficiency and uniformity.

Our findings reveal the effectiveness of a classification-driven intelligent system in revolutionizing the assessment of higher education examination papers' quality. The results affirm our success in automating the evaluation process, achieving high efficiency and accuracy in assessing both the form and content of examination papers.

When juxtaposed with existing scholarly works, our outcomes are consistent with the findings of Mohammed and Omar (2020), and Osman and Yahya (2016), who acknowledged the proficiency of certain machine learning algorithms in attaining notable classification accuracy, with the logistic regression (LR) classifier emerging as the most effective. Contrary to the experiences of Kumara et al. (2019), Jayakodi et al. (2016), and Syahidah Haris & Nazila Omar (2012), who encountered obstacles in question classification using a rules-based approach, our machine learning-based model exhibited enhanced performance in this domain, benefiting from sophisticated natural language processing technologies.

Additionally, the contributions of Kumara et al. (2019) and Sangodiah et al. (2022) regarding the use of AI in automated multiple-choice question evaluation have been acknowledged for their role in reducing administrative tasks and boosting grading efficiency. Our research broadens this scope to include not only multiple-choice but also intricate, open-ended questions, thereby expanding the utility and innovation of automated assessment technologies. In summary, our proposed classification model outperforms previous iterations by 9%, as evidenced in Table 11, marking a significant advancement in the field.

Our findings complement theirs by implementing context-aware algorithms that improve accuracy in evaluations, demonstrating our system's enhanced capability to handle the nuances of different disciplines.

These comparisons illustrate how our research builds upon and diverges from existing studies, highlighting our contributions to the field of educational technology and the ongoing development of automated evaluation systems.

  • Educational Implications: The implications of our findings extend beyond mere operational efficiencies. By automating routine evaluation tasks, educators can redirect their focus towards more nuanced and qualitative aspects of teaching and learning, such as personalized feedback and intervention strategies. This shift could profoundly impact educational outcomes, fostering a more engaging and supportive learning environment.

  • Limitations: Despite promising results, our study acknowledges limitations, including the system's performance variability across different subject matters and question types. The reliance on a predefined dataset for algorithm training also raises questions about the system's adaptability to new or evolving educational content.

6 Conclusion and future work

The current study has developed a novel intelligent system based on image processing and machine learning techniques to evaluate exam papers. This system assesses exam paper quality from two main aspects: formal specifications, ensuring each paper fulfills its basic data requirements, and content specifications, verifying that exam questions adequately cover different cognitive objectives according to Bloom's standard percentages. The implementation of this intelligent system in the Measurement and Evaluation Unit marks a significant stride towards automating the evaluation of university examination papers. This automation aligns with the NAQAAE framework and supports academics in creating balanced and high-quality exams that align with intended learning outcomes and meet quality standards. The classification model at the heart of this system has demonstrated a high accuracy rate of 98.5%, outperforming several related models and showing remarkable alignment with expert evaluations in terms of form and content.

  • Future Directions: Our study's findings and the system's limitations have illuminated several paths for future research. Expanding the dataset to include a broader range of disciplines and question formats could greatly enhance the system's versatility and applicability. Additionally, investigating adaptive learning algorithms may address the evolving nature of content over time, offering a dynamic approach to exam paper evaluation. Further integration of this system within a comprehensive educational technology ecosystem could provide valuable insights into its impact on student engagement and learning outcomes. We are committed to refining our model with larger datasets and advanced techniques such as deep learning to explore these future directions fully.