Online exams and the COVID-19 pandemic: a hybrid modified FMEA, QFD, and k-means approach to enhance fairness

COVID-19 pandemic caused an increasing demand for online academic classes, which led to the demand for effective online exams with regards to limitations on time and resources. Consequently, holding online exams with sufficient reliability and effectiveness became one of the most critical and challenging subjects in higher education. Therefore, it is essential to have a preventive algorithm to allocate time and financial resources effectively. In the present study, a fair test with sufficient validity is first defined, and then by analogy with an engineering product, the design process is implemented on it. For this purpose, a hybrid method based on FMEA, which is a preventive method to identify potential failure modes and prioritize their risk, is employed. The method's output is provided to the QFD algorithm as the needs of product customers. Then, the proposed solutions to prevent failures are weighted and prioritized as the product's technical features. Some modifications are made to the classic form of FMEA in the proposed method to eliminate its deficiencies and contradictions. Therefore, our proposed algorithm is a precautionary approach that works to prevent breakdowns instead of fixing them following their occurrence. This issue is very effective in increasing the efficiency of activities in times of crisis. Eventually, a prioritized list of preventive actions is provided, allowing us to choose from available solutions in the circumstances with limited time and budgetary, where we cannot take all possible actions.


Introduction
The COVID-19 pandemic, also known as the Coronavirus pandemic, is an ongoing global crisis that caused significant alterations to academia, demanding new regulations and creating unprecedented challenges for both learners and tutors [1]. In order to minimize the transmission of the contagious virus, students have to study from home. Education systems need to provide online system strategies for teaching, learning, and evaluation methods to help with this transition. Besides the current demand for online education as an effect of the pandemic, some of the new practices imposed by the current pandemic situation can be maintained and used even after the crisis [2]. Investigation and analysis of how pandemic effects academic activities help us overcome current challenges. We can use this experience to enhance our academic measures and advance online education capabilities [3].
With the outbreak of Coronavirus (COVID-19) disease, online exams became common practice for academic evaluation. Online exams offer several desirable advantages such as time efficiency [4], ease of use [5], enhanced adaptability [6,7], and provision of immediate feedback [8]. On the flip side, computer and internet accessibility [9], lack of experience with computer or online assessment processes [10], test anxiety [11], and higher cheating rates [12,13] are some of the main challenges that come with online exams.
Given the critical pandemic situation, online exams are inevitable and will increase even in non-critical situations. Therefore, in order to hold them more fairly, methods should be considered, and possible failures should be identified to be mitigated or eliminated as a precaution. Therefore, the basic questions, or in other words, the objectives of our research, are as follows: • What is the definition of a fair exam? • Who are the customers of an online test process, and what are their needs? • What is the priority and importance of each of these needs for them? • What characteristics of the process can be effective in meeting these needs, and to what extent?
The ultimate goal is to provide a list of things that we can do to have a fairer online exam.
Fairness is often regarded as the most important pillar of examinations, which strongly affects students [14,15]. Exam fairness preserves academic integrity and improves the students' motivation to enhance their performance [16,17]. There are numerous challenges to fairness in online exams, such as limited proctoring options and higher cheating rates [18].
The current circumstances and the necessity to employ online exams while eliminating their shortcomings exhibit the demand for an effective algorithm. Failure mode and effects analysis (FMEA) can be a robust tool for this matter. FMEA is a widely used technique to diagnose and prevent product, system, and operation failure modes before occurrence [19]. As Lolli et al. mentioned in their work in 2016, FMEA is primarily performed by providing a list of potential failure modes, assigning numbers associated with the severity, detection, and probability of occurrence to each of these events, and eventually obtaining the risk priority number or RPN from the multiplication of these numbers. The performance of FMEA relies entirely on proper determination of the numbers of intensity, detection, and occurrence, and thus the RPN values. For the intensity and detection numbers, which are essentially subjective values, this is less of a challenge than the occurrence number, which has an objective nature [20].
K-means clustering method is one of the plainest yet most commonly used unsupervised intelligent learning algorithms. It can help us prevent conflicting situations, especially in the assignment of occurrence probability numbers [20].
In 2002, Berget and Naes [21] introduced a fuzzy K-means-based clustering algorithm for sorting raw materials to improve the quality of the final product, which works similarly to an optimization problem. In 2004, Sarkar [22] proposed a clustering algorithm for failure modes to investigate the probabilities of each state occurring. Also, in 2014, Lolli et al. [23] presented an application of K-Means for sorting according to multi-criteria classification, the key information of which can be the basis for presenting an algorithm with our intended purpose. In a similar work in 2016, Lolli et al. [20] used K-Means to resolve inconsistencies in the "occurrence" parameter, which is a subjective parameter of FMEA.
In the last step of the FMEA method, a list of preventive and corrective actions is presented to mitigate the occurrence, minimize the effects, or enhance the probability of detecting improper conditions [24]. Contrarily, time and cost limitations make it impracticable to use all offered solutions to eliminate every unfavorable situation. Therefore, we need to rank and prioritize recommended corrective actions.
Quality function deployment (QFD) is an effective and robust means commonly used to design engineering products aiming to reach maximum customer satisfaction. In this method, customer needs are associated with the product's technical characteristics in the QFD matrix. Eventually, the QFD process results in a ranked and weighted list of technical product features [25,26]. In an analogy, failure modes are regarded as customer needs, the RPN as the priority of needs, and the listed preventive and corrective actions as the product features. These entities are supplied to the QFD. The final goal of this algorithm is to present a weighted list of corrective and preventive actions as the output.
In the event of crises such as the Covid-19 pandemic and the increasing demand for online testing, along with time and resource limitations that are more severe at this time, it is essential to have a preventive algorithm for the effective allocation of financial and time resources. The main innovation of the proposed algorithm is to simulate the online test process with an engineering product and then simultaneously use tools FMEA, K-Means, and QFD to design it. The most important advantages of such an algorithm are as follows: The proposed method in this research first identifies all groups that are internal or external customers of this process. It is based on a survey of all customers, to be a comprehensive approach. One of the basic foundations of the proposed method is FMEA, which is inherently preventive in nature. Therefore, our algorithm is preventive and so deals with the prevention of the faults, instead of repairing them after their occurrence. This issue is very effective in increasing the efficiency of activities in times of crisis. Also, FMEA has contradictions that have been largely resolved in the proposed algorithm using K-Means. Employing QFD, as a tool based on maximum customer satisfaction, is very efficient in resource allocation. Therefore, time and financial resources, that are limited especially in times of crisis, will be spent on activities that ultimately lead to greater process customer satisfaction.
The proposed algorithm has been implemented on mechanical engineering students at the Sharif University of Technology for two consecutive semesters. This paper aims to improve exam fairness by analyzing the worries and challenges that students of the Sharif University of Technology have experienced during their online exams in times of the COVID-19 pandemic. The results are presented and investigated in this paper.

Materials and methods
We aim to provide an algorithm that can be deployed to identify existing and potential defects of a fair online exam. Then, find and prioritize possible solutions. The prioritization is necessary since it is impossible to implement all possible solutions regarding time and cost limitations. So, we can only apply the most effective solutions and disregard less effective ones.
For this, it is necessary to define a fair exam at first, and then, according to its characteristics, potential failure modes and their effects should be identified. Solutions to eliminate or reduce the effects should be provided and prioritized.

Definition of a fair exam
In an online survey, we asked college students and professors to provide their definitions of a fair exam. Additionally, they were requested to list potential problems that they have encountered, describe their effects, and suggest solutions for more fairness. Twelve university professors and 118 students participated in the survey. In order to have a relatively homogeneous statistical population that covers a broad spectrum, in the group of professors, three people are in mathematics and engineering, three in medicine, three in humanities, and three in art. Three people in each category included a highly experienced professor (more than 20 years of experience), a moderate professor (between 10 and 20 years of teaching), and a young professor (less than 10 years of experience). Also, from each of the disciplines mentioned in the professors' group, 30 students were selected with a combination of 10 students with a GPA of A, 10 students with a GPA of B, and 10 students with a GPA of C. In the art group, the survey of two students with a GPA of C was invalid and resulted in a total of 118 students. Summarizing the commonalities and rewriting their views led to the following definition: A fair assessment occurs when participants' knowledge of the presented topics is measured appropriately, they have equal conditions, and they are fully justified with the outcome [15]. Moving toward the above expressions will lead to a fairer exam.

Basic FMEA
FMEA is a powerful engineering tool for the identification of potential failure modes and their sources. This process is done through thinking about a product, process, or service in reverse [27]. In this method, the effect of each failure mode on the customer is represented by the severity number (S). Likewise, the likelihood of detecting a failure when it occurs is shown by the detection number (D), and the probability of its occurrence is reported by the occurrence number (O). These three numbers lie within the range of 1 to 10. Higher severity and probability of occurrence lead to larger O and S numbers. The D number becomes larger when preventive detection of the failure mode is unlikely. The risk priority number (RPN) is: where RPN ranges between 1 and 1000, and higher numbers indicate a risk of the failure mode [28][29][30][31]. The scales used to determine the S, O, and D values are provided in

Modifying basic FMEA using K-means clustering
Proper determination of RPN relies on the correct assignment of S, D, and O values. The nature of these numbers implies that S and D are subjective, but O is objective. Particularly, the magnitude of O depends on the occurrence records of a failure mode. Suppose that O values lie within the range of 1-10, which suggests that occurrence probabilities are divided into ten distinctive classes. Consequently, if a type of failure occurs up to 2000 times a year, the range of each class will be 200. This is shown in Fig. 1. Now, assume a failure mode occurs 596 times, and another failure mode happens 604 times. In this case, the first failure will be in the third class, while the second one will be in the fourth class, knowing that it happened only eight times more than the first one. This paradox casts doubt on the accuracy of occurrence number assignments.
We use the intelligent, nonlinear clustering method of k-means to resolve this issue. In this algorithm, k cluster centers are randomly selected, where k is user-specified. In the next step, the Euclidean distance between each point and the cluster centers is measured. Each point is assigned to the cluster with the most adjacent center. When all existing points are allocated, new centroids are recalculated by averaging between each cluster's members. When all existing points got allocated to different centers, new centers are recalculated by averaging between each cluster's members. This process continues until the predetermined ending condition is fulfilled [34]. K-means clustering method is an unsupervised learning algorithm [35,36]. For assigning the Occurrence number, we divide the range into ten classes. In order to assign the Occurrence number, the range was divided into ten classes. Then, the midpoint of each class, along with other data points, was given to k-means as input. Since k-means does not leave any cluster empty, this process excludes the risk of placing two points with a close number of occurrences in two separate clusters. Similarly, it is unlikely for two far values to end up in two consecutive classes. Consequently, the paradox with the results will be resolved [37].

Modifying risk priority numbers using fuzzy logic
High intensity, regardless of RPN, means high risk [38]. Because, even if the probability of its occurrence is low or the possibility of its preventive detection is high, it can lead to adverse effects on the process customers. Therefore, risky situations are the sum of failure modes with a high RPN plus high severity cases. The combination of these two factors can be done in different methods, but it depends entirely on the nature of the factors and the way of human inference. In such conditions, the closest tool to human inference is a fuzzy logic-based system [39]. The most common concepts of fuzzy systems are pure fuzzy, fuzzy Sugeno Takagi base, and Mamdani base [33,40]. In the case of human inferences, which require the use of expert knowledge with linguistic variables, fuzzification of them, inference, and then defuzzification, the most appropriate option is fuzzy systems based on the Mamdani algorithm [15].
To achieve this goal, a fuzzy inference system has been formed, with two inputs and one output. The shape of the membership functions of the inputs and output, which are of type Trimf (Triangular-shaped membership function), is as shown in Fig. 2.
Also, the fuzzy rules and its inference system are as follows:

If (Severity is Low) and (RRPN is Low) then (MRPN is Low). 2. If (Severity is Low) and (RRPN is Moderate) then (MRPN is
Low).

If (Severity is Moderate) and (RRPN is Low) then (MRPN is
Low).

If (Severity is Moderate) and (RRPN is Moderate) then
(MRPN is Moderate).

If (Severity is Moderate) and (RRPN is High) then (MRPN is High). 7. If (Severity is High) then (MRPN is High).
The result of fuzzy rules and the relationship of the inputs to the output is according to the surface drawn in Fig. 3.
Therefore, if we call this fuzzy system as "Risk", we can say that: where MRPN is the modified value of RPN, assuming the high values of severity are risky, and MRPN is in the range of 0 and 100.

Prioritizing actions using QFD
After determining the RPN value, possible preventive and corrective actions are determined for each failure mode. The quality function deployment (QFD) is used to determine the priority of each proposed solution. QFD is a customer-oriented method in designing new engineering products, aiming to maximize customer satisfaction [26,27,41]. The main idea of QFD is to provide a list of prioritized customer needs related to the product. Then, the technical characteristics of the product are specified. The QFD matrix, shown in (3), is the mapping of needs to technical characteristics of the product [25,42,43]: Shape of the membership functions of a input variable "Severity", b input variable "RRPN", and c output variable "MRPN" for fuzzy system "Risk" | https://doi.org/10.1007/s42452-021-04805-z where W ij shows how much the jth technical characteristic meets the ith need. A j is a technical characteristic, and R i is the priority number for ith need. Now, the weight of each technical feature is calculated by Eq. (4): Equation (5) shows the normalized weight.

The proposed algorithm
In an analogy with the design of an engineering product, the steps for performing the proposed algorithm will be as follows: Step 1 Identifying the Customers: Customers of an online exam process fall into two categories: "professors and assistants" as group A and "students" as group B. In expressing the reason for classification and in an analogy with engineering products, process customers can be classified into two categories: "Manufacturers and Service Providers" (domestic customers) and "Consumers" (foreign customers). Here, professors and assistants are as manufacturers and service providers and students as consumers. Also, their opinions about possible failure modes are considered as the voice of the customer (VOC) or customer complaint. Suppose the number of people in group A is N A , and the number of people in group B is N B .
Step 2 Exploration of potential failure modes: Using a survey of groups A and B, all possible failure modes are identified. Each failure mode is called F j . Suppose the total number of failure modes is m. Therefore: where F is the set of failure modes.
Step 3 Determine severity numbers (S): For each F j , determine the values S j A and S j B , which are the average severity assigned to that failure mode by the individuals in groups A and B, respectively. Then calculate the value of S j according to Eq. (7): Step 4 Determine detection numbers (D): For each F j , the D j value is determined, which is the average of the detection number assigned to that failure mode by individuals in group A. (In this case, the poll is conducted only from group A).
Step 5 Identify the repetition of each failure mode: q A j Is the value which the failure mode F j is repeated in group A, and q B j is the same value in group B. q j , number of repetitions of failure mode F j , is calculated from Eq. (8): Step 6 Calculate the central points of the occurrence intervals: The maximum and minimum values of q j obtained in step 5 are called q max and q min, consequently. Therefore, the center of each occurrence interval can be calculated from (9): where q ′ l is the center of the lth interval, and l is a digit from 1 to 10.
Step 7 Calculating Occurrence values (O), using k-means: Assume the set Q as Eq. (10): Then using Matlab-R2013 software, O j = k-means (Q, 10), obtain the results where, O j is the number of cluster and shows the occurrence value.
Step 10 Extract the possible solutions of each Fj: This is done using a survey of people in both groups A and B. Similar and close values are conceptually unified. R indicates the total number the solutions (preventive and corrective actions), which we present as the set C: where C r is the rth solution.
Step 11 Forming a QFD matrix: In an analogy to the engineering products, failure modes of online academic exams are given as the customer needs. Here, priority number of each customer demand is MRPN of each failure mode (MRPN j ), and suggested solutions will be the product technical characteristics (C r ). To fill the matrix, we acquire the average values from groups A and B. Therefore: where W jr is the effect of the solution C r on the failure mode F j . According to (3), the weight of each solution (W r ) will be as (13): And, the normal weight of each solution ( W N r ) is: Step 12 Prepare a list of preventive and corrective actions along with their priorities: A prioritized list containing the set C = C r , is presented as the result of the algorithm. W N r shows the solution's weight, which also indicates its priority.

Results
Before implementing the proposed algorithm, as mentioned in Sect. 2.1, a survey was conducted to define a fair exam. At the same time, the most significant aspects of the impairment of this definition were asked, and the following 12 attributions were derived: It is worthy to note that another customer of the process is the "educational system", whose needs are hidden within the needs of the two mentioned groups, with the aim of not prolonging the content and diverging the results. For example, we can say the relevance of the exam content to the taught topics and appropriate references ensures that the training is in line with the objectives of the education system. Prevention of widespread cheating in the exam guarantees the validity of the training provided by the educational system, and clarifying the demands for exams follows the goals of the education system.
Then, the proposed algorithm was implemented in two consecutive semesters (spring 2020 and fall 2021). 80 people, including 60 students, 8 professors, and 12 teaching assistants (20 people in group A and 60 people in group B), participated in it. Based on steps 1 and 2, the results show that a total of 33 potential failure modes (Matrix F) are given in    For more clarifying, these failure modes are classified into these 12 attributes (Column 2). Also, the causes for each one (obtained through surveys) are given in the fourth column of this table.
The Severity number ranges between 1 and 10. Severity numbers above 7, marked by a dashed line shown in Fig. 4, are highly critical and must be treated regardless of their overall RPN number. According to step 3 of Sect. 2.6 and based on Eq. (7), to obtain the numbers related to the severity of each failure mode, the averages are calculated separately in each of groups A and B and listed in columns S j A and S j B , respectively, in Table 3. Also, the average of these two values is calculated and placed in the third column (S j ). Obviously, the average between these two numbers, considering the number of members in each group (20 people in group A and 60 people in group B), indicates that the influence of each person's opinion in group A is more than group B.
Then, for detection number, the average value of the detection numbers assigned to each failure mode by individuals in group A is calculated and reported in Table 4. The detection number ranges between 1 and 10. It is divided into three parts: the range 0-3 as easy and obvious diagnosis, the range 3-7 as the average and normal diagnosis, and the range 7-10 as difficult to diagnose. These sections are shown in the diagram with two dashes in Fig. 5.
According to Sect. 2.6, step 6, the number of repetitions of each failure mode are calculated and presented in Table 5, and the values of q max = 61 and q min = 9, are determined. Next, using Eq. (9), the center of each occurrence interval is calculated as follows: After it, as mentioned in Sect. 2.6, step 7, we form set Q = q j ⋅ q � . The occurrence numbers values are obtained for each failure mode, using k-means (Q,10) in MATLAB R-2013 software into 10 categories. The center of clusters obtained from the k-means process is listed in Table 6. The occurrence numbers (O) are arranged in Table 7.
Then, using Eq. (1), RRPN j = S j × D j × O j , RRPNj values are calculated as shown in Table 8. For modifying the value of RRPN, the MRPN is determined using Eq. (2) by applying the fuzzy inference system "Risk". This can be seen in Table 9.
At the next step, a total of 41 possible solutions for failure modes were extracted using a survey of people in both groups A and B. As mentioned in Sect. 2.6, step 10, similar and close values are conceptually unified and arranged in Table 10 as C1 to C41.
After listing the solutions, according to step 11 of Sect. 2.6, in an analogy to the engineering products, the QFD matrix was generated. Failure modes are given as the customer needs, and MRPNs are their priority. Suggested solutions are assumed as technical product characteristics (C r ). Then, using average values from groups A and B, the QFD matrix was completed. The weight (here, priority) and normalized weight of each solution were obtained by applying (13) and (14), respectively. The result is presented in Table 11. As mentioned in step 12, this is the final result of the proposed algorithm. Prioritized actions are listed in Table 12.

Discussion
According to Table 1, if the severity numbers are in the range of 7-10, they express the major effect of the failure mode on the end-user. Therefore, the number 7 is marked in the diagram with a dividing line as the "threshold". The highest severity numbers in the critical region (F11, F15, F10, F2, and F14) show that the most confusing and dissatisfying effect in an online test is related to credibility and fraud prevention. Also, if the scoring is not entirely consistent with a specific policy, it can cause severe adverse effects. On the other hand, designing test questions by someone other than the instructor can cause serious problems.
According to Tables 1 and 4 and Fig. 5, only one failure mode is within the difficult detection range, which is F14 (inconsistency in grading). It is quite logical that if the question designer (who should be the instructor himself/herself ) does not provide a specific key to grading the exam answer scripts, it will not be easy to identify the consistency of the results.
Considering the numbers in Table 6, which is derived from the proposed k-means system for determining occurrence numbers, and Fig. 6, which is a comparative graph of occurrence values, the failures with the most likely to occur (Containing F13, F5, F19, F22, F27, and F28) do not have high severity. Therefore, it can be concluded that the online exams that have been held so far are mainly at an acceptable level of customer satisfaction, and efforts should be more focused on improving the current level.   Also, the presence of these failure modes in the list of the high probability shows that the main reasons for the occurrence of failure modes are the way the instructor teaches, the exact expression of expectations, and the appropriateness of time with the questions.
According to Table 9 and Fig. 7, failure modes with the highest MRPN (modified values of the risk priority number) containing F11, F10, F2, F15, and F9, the main critical issue related to an online exam is cheating, which can undermine the validity and the fairness of an exam. Also, the presence of heterogeneity or an incorrect key can disrupt the whole result. On the other hand, if the questions are not from the taught topics, the test is invalid. In general, it can be said that if cheating is prevented, we can hopefully accept the appropriateness of the online exam.
To evaluate the effectiveness of the fuzzy inference system in making more appropriate criteria for comparing the criticality of each failure mode, we should study the cases with the most changes in the initial RPN number. To do this, both RRPN and MRPN values should be normalized. The normalization range here is 1-100, depending on the numbers available. From Table 12 and Fig. 8, the most changes in the order of increasing priority occurred in F15, F14, and F17. These failure modes do not have very large RRPNs, but their severity value is high. Therefore, the fuzzy inference system has led them to increase priority. This indicates the correct operation of the modifier system.
Time and cost constraints prevent us from implementing all corrective and preventive actions (C1 to C41 in Table 10). Therefore, we need to prioritize them. High-priority solutions will be actions that can prevent more hazardous failure modes. Table 11 shows that the most important actions to maximize customer (including faculty, assistants, and students) satisfaction during an online test (containing C33, C19, C36, C16, and C9), the exact expressing of the expectations in the test and evaluation methods, holding the exam with sufficient supervision at the right time and place, designing exam questions and key by the instructor him/ herself and also, the existence of appropriate infrastructure, can prevent potential problems in an online test.
It is also emphasized that, since this method is based on FMEA, the provided solutions have a preventive aspect, leading to a reduction in adverse effects in emergencies such as the recent pandemic of the COVID-19.
Using different methods of data mining and data processing, such as AHP, ANP, and DEMATEL, can be very helpful in better analyzing the results.

Conclusion
The COVID-19 pandemic and the need to adhere to health protocols, including avoiding crowded gatherings, have led to a sudden and growing demand for online college classes. The assessment process is one of the most  The exam is a combination of oral and written C21 Selection of a valid scientific reference to correct questions In this study, a fair online exam is defined as a test that leads to customer satisfaction (including faculty, assistants, educational system, and students). Then, in analogy to an engineering product, the product design process is performed on it. As the first stage, the FMEA process, which is a preventive method in identifying potential failure modes, is employed to find the potential failure modes, their severity, occurrence, and preventive detection method. Then, the risk priority number of each case is calculated. The K-means method, which is an unsupervised clustering algorithm, has  The results show that if the taught topics and exam titles are consistent, the instructor's expectations of the students are clear, there is a clear assessment policy, the test is held under adequate supervision at the right time and place, and with the appropriate infrastructure, the test questions are designed by the instructor him/herself, the maximum satisfaction of the stakeholders will be obtained. According to the provided definition, it will lead to an increase in the validity of the online test.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval The manuscript is original, and neither has been published before in any form or language, partially or in full, nor submitted to any other journal for simultaneous consideration.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.