Item analysis of university-wide multiple choice objective examinations: the experience of a Nigerian private university

Teachers and Students worldwide often dance to the tune of tests and examinations. Assessments are powerful tools for catalyzing the achievement of educational goals, especially if done rightly. One of the tools for ‘doing it rightly’ is item analysis. The core objectives for this study, therefore, were: ascertaining the item difficulty and distractive indices of the university wide courses. A range of 112–1956 undergraduate students participated in this study. With the use of secondary data, the ex-post facto design was adopted for this project. In virtually all cases, majority of the items (ranging between 65% and 97% of the 70 items fielded in each course) did not meet psychometric standard in terms of difficulty and distractive indices and consequently needed to be moderated or deleted. Considering the importance of these courses, the need to apply item analyses when developing these tests was emphasized.


Introduction
Multiple choice objective tests items are easy to score and analyze but often technical, time consuming and at times painstaking in development. To cover a wide scheme of work or syllabus adequately, it is imperative that multiple-choice objective test be used. When assessing a large population of students, the use of multiple-choice question (MCQ) is the most logical option. The challenges however are: tendency to write poor MCQs with ambiguous prompts, poor distractors, multiple answers when question demands only one & Jonathan A. Odukoya adedayo.odukoya@covenantuniversity.edu.ng & Olajide Adekeye olujide.adekeye@covenantuniversity.edu.ng correct answer, controversial answers, give-away keys, higher probability of testees guessing correctly to mention but few of the challenges of developing and using MCQs. There is hardly any subject that cannot use MCQ. However, when assessments border on life sensitive issues like health, air flight (and the like), it should be applied with caution. The reality, however, is that virtually all assessment purposes are life sensitive. The results of virtually all assessments are often used to make sensitive decision that determine people's destiny. It is therefore imperative that MCQs be handled rightly at the development, administration, scoring, grading and interpretation stages. The focus of the study reported here is on the development stage of MCQs, with particular emphasis on item analyses.
The first critical step in developing valid MCQs is recruiting relevant subject experts with requisite skill in writing of MCQ items. The correct handling of this stage will go a long way in setting the pace for the establishment of the content validity of the test. However, the validity of MCQs cannot be completely ascertained with skillful item writing alone. Psychometric requirement demands that such items be trial tested, while the responses and scores generated are subjected to statistical item analyses. Ary et al. (2002) opined that item analysis involves use of statistics that can provide relevant information for improving the quality and accuracy of multiple choice question. There are three popular forms of item analyses: item difficulty index, distractive index and discriminatory index.
Item difficulty index indicates the degree of difficulty of the MCQ items in relation to the cognitive ability of the testees (Boopathiraj and Chellamani 2013). It is calculated by finding the proportion of the testees that got the item correctly. An item is adjudged too difficult when the index is below 0.3. An item is adjudged too easy when the index is above 0.7. Depending on the purpose of the test, the cut off points for easy or difficult items can be adjusted upward or downward. Generally, the rule is that life sensitive or competitive activities require more technical/difficult items in screening; while less sensitive activities or activities requiring motivation of testees often use less difficult items. For most summative assessments, such as those handled by the West African Examinations Council, moderate difficulty index ranging around 0.5 are often preferred.
It is important to note that an item may record high difficulty index if the content of such item was not taught, the concept was not understood or if the question was not properly worded. According to Suruchi and Rana (2015) the two purposes of Item analysis are: firstly, to identify defective test items and secondly, to indicate the areas where the learners have or have not mastered. This is actually the essence of item analysis-to check for flaws of this nature and find ways of correcting them before finally administering the questions (El-Uri and Malas 2013). Item moderation, therefore, naturally follows item analysis. Where an item cannot be moderated, it is often discarded and replaced.
The distractive index determines the power of the distractor (i.e. the incorrect options in a MCQ) in distracting the testees. The distractive index is computed in virtually the same way as the difficulty index. It is the proportion of tesstes who selected a distractor out of all the testees that sat for the test. When a distractor distracts few or no testee, it is concluded that such is a poor distractor and should be reviewed. When a distractor over-distracts, that is, distracts about the same proportion or higher proportion of the testees that are selecting the key (i.e. right option), such option is also due for review or replacement. Sabri (2013) submitted that discriminatory index depicts the power of an item in discriminating between high and low performing Testees. Item discrimination determines whether those who did well on the entire test did well on a particular item. An item should in fact be able to discriminate between upper and lower scoring groups. One way to determine an item's power to discriminate is to compare those who have done very well with those who have done very poorly, known as the extreme group method. First, identify the Testees who scored in the top one-quarter (upper quartile) as well as those in the bottom one-quarter of the class (lowest quartile). Next, calculate the proportion in the upper and lower quartiles that answered a particular test item correctly. Finally, subtract the proportion of Testees who got the item right in the bottom performing group from the proportion of Testees in the top performing group who got the item right to obtain the item's discrimination index (D). Item discriminations of D = 0.50 or higher are considered excellent. D = 0 means the item has no discriminatory power, while D = 1.00 means the item has perfect discrimination power. It is therefore expected that more of the high performing Testees should get an item right while few of the low performing students should get the same item right. When more Testees who generally perform poorly in a test tend to select the right option for an item and those who performed well are selecting wrong options as answer, then something is apparently wrong with such an item. It calls for item review or discard. Thus, item analyses activities work to enhance the overall validity of a test. Kehoe (1995) observed that the basic idea that we can capitalize on is that the statistical behavior of ''bad'' items is fundamentally different from that of ''good'' items. This fact underscores the point of view that tests can be improved by maintaining and developing a pool of ''good'' items from which future tests can be drawn in part or in whole. This is particularly true for instructors who teach the same course more than once. Item analysis is a tool to help the item writer improve an item (Gochyyev and Sabers 2010).
Over the years, tertiary institutions have come to realize the significance of some lifeenhancing concepts that should be learnt. It is these vital life-enhancing information that have been packaged as university wide courses. Consequently, some universities have compulsory courses like General Studies, which covers use of languages and philosophical issues; Total Man Concept; Entrepreneurship Development Studies; Human Development etc. Some of these courses are zero unit but compulsory. The truth is that knowledge, especially applicable and relevant knowledge, are powerful and life transforming. It is therefore imperative to teach and assess these courses professionally for maximum impact. It is against the backdrop of these points this study was undertaken.

Statement of problem
Inadvertent omission of item analysis in the process of developing Multiple Choice Questions (MCQ) for compulsory university-wide courses that solely use MCQ could jeopardize the integrity of assessment and certification. Incorrect application of item analysis results could yield the same fate. As a compulsory course, failure could translate to affected students spending extra year on campus. This has implications on the psychological state of concerned students'. The emotional offshoot of failing and having to spend an extra year with one's juniors could translate to a number of debilitating medical, psychosomatic and psychological challenges. On the other hand, unprofessional assessment could lead to wrong award of grades and certificates. In the study of computer adaptive testing, Cechova et al. (2014) reiterated this point when they surmised: 'Every year, hundreds of secondary school students take university entrance exams, and their results determine entry into universities or possible alternatives, such as employment. In the same way, every year university teachers face the challenge of how to cope with the increasing number of examination candidates vis-à-vis maintaining the validity of the tests'.
Item analysis of university-wide multiple choice objective… 985

Statement of significance
Professional conduct of item analysis and concomitant item moderation of items comprising the university wide courses is apt to enhance the overall validity of such tests. This in turn is apt to significantly reduce frustrations for the individual and the society at large. Correct assessment, with application of essential psychometric practices like item analysis is apt to enhance the quality of assessment, evaluation and certification (IAR, 2011).

Statement of objectives
• Find out how appropriate the difficulty indices of the items comprising the university wide courses are? • Determine the appropriateness of the distractive indices of the options making up the items in the university-wide course MCQs?

Research questions
• How appropriate are the difficulty indices of the items comprising the university wide courses? • How appropriate are the distractive indices of the options making up the items in the university wide course MCQs?

Method
The ex-post facto design was adopted for this study. Secondary data were collated and analyzed.
The population for this study were undergraduates of private universities in Nigeria. They were estimated at about one million as at the time of this study.
The responses of over 1500 students that responded to the MCQs of the university-wide courses at various times were harvested and analyzed. Students responses in following courses were analyzed: EXX 121 (N = 1907; Test taken 2015); GXX 121; N = 1956; Test taken 2015); HXX 421 (N = 112; Test taken 2015); TXX 121 (N = 1905;Test taken 2015). Note that original course codes have been changed for anonymity. These courses were chosen for this study largely because they are compulsory for students' graduation, irrespective of program, and because the relevance of the course content to overall wellbeing cum success in life.
The core instruments for this study were the past MCQ items for four core universitywide courses.
The responses to past MCQs were harvested from the University's Data Centre. The major statistical analyses conducted were difficulty index and distractor index, using proportion and simple percentage. The formulas applied in this regard are: Distractive index ¼

Number of times an option was selected Total number of respondents
For multiple choice questions with one correct answer format: Difficulty index ¼ Number of respondents who selected the right option Total number of respondents The following decision rules were applied to determine items that are Okay (OK), Fairly Okay (F/OK), Need Moderation (NM), and Need Serious Moderation (NSM): When the difficulty index is over 0.7 (i.e. 70%) or below 0.2 (i.e. 20%), such item is adjudged not okay and needs moderation. The difficulty index was computed with the proportion of Testees selecting the correct option as indicated by the bold figures in Table 1 below. When the distractive index for a distractor or incorrect option is far above or far below 0.166 (i.e. 16.6%), there is need for moderation. The rationale for this decision is that for a test that operates by the principle of moderate difficulty of 0.5, the remaining 0.5 should be fairly shared equally between the 3 distractors (for a 4-option item), which gives 0.166. Any item falling short of these two requirements is apt to require moderation.

Results
The item analysis results in Tables 1 and 2 show that majority of the items (approximately 86% of the 70 items fielded) did not meet psychometric standard (of appropriate difficulty and distractive index) and consequently need moderation.
The detailed table of results from which the summary in Table 3 was drawn is in ''Appendix 1''. The item analysis results in Table 3 show that a notable proportion of the items (approximately 66% of the 70 items fielded) did not meet psychometric standard (of appropriate difficulty and distractive index) and consequently require moderation or deletion.
The detailed table of results from which the summary in Table 4 was drawn is in ''Appendix 2''. The item analysis results in Table 4 show that a notable proportion of the items (approximately 97.1% of the 70 items fielded) did not meet psychometric standard (of appropriate difficulty and distractive index) and consequently need moderation. Only 2.9% of the items were fairly okay. The operational psychometric standard for this study is that at least 70% of the items should be okay while the remaining 30% could be fairly okay.
The detailed table of results from which the summary in Table 5 was drawn is in ''Appendix 3''. The item analysis results in Table 5 show that a significant majority of the items (approximately 83% of the 70 items fielded) did not meet psychometric standard (of appropriate difficulty and distractive index) and consequently need moderation.

Discussion
The core research questions for this study are: 'How appropriate are the difficulty indices of the items comprising the university wide courses?'; and 'How appropriate are the distractive indices of the options making up the items in the university wide course MCQs?' Item analysis of university-wide multiple choice objective… 987  Item analysis of university-wide multiple choice objective… 989 the items (ranging between 65% and 97% of the 70 items fielded in each course) did not meet the psychometric standard used in this study. These findings call for concern. Contrary to the core findings in this study, Bichi (2015) found that out of the 40 items in a test assessed, 12 (30%) items failed to meet the set criteria of item quality and therefore needed moderation while 28 items were judged to be 'good' items. This appears to be a better result, yet he further recommended that the assessment of science secondary school students' achievement should be subjected to item analysis to improve their quality.
In a related post-examination analysis of objective tests, Tavakol and Dennick (2011) reiterated that one of the key goals of assessment in medical education is the minimization of all errors influencing a test in order to produce an observed score which approaches a learner's 'true score', as reliably and validly as possible. This is actually the core objective of all empirical assessment worldwide. From the results obtained from the current study therefore, it may be difficult to unequivocally conclude that the scores obtained by the students were their true scores and a true reflection of their ability. There is clearly need to conduct further psychometric assessment of these and related courses to ascertain the veracity of these findings.

Recommendations and conclusion
This study sought to establish appropriateness of the difficulty and distractive indices of four compulsory university-wide courses in a Nigerian private university. In virtually all cases, majority of the items (ranging between 65% and 97% of the 70 items fielded in each course) did not meet psychometric standard in terms of difficulty and distractive indices. On the strength of the findings made from this study, and based on recent submissions on this subject (as cited above), it is recommended that the development of all the university-wide courses employing the MCQ format should commence with preparation of test blueprint followed by carefully adherence with the rules for writing multiple-choice objective questions (MCQs). Thereafter all items should be trial tested, item analyzed and subjected to item moderation to enhance the overall content and construct validities. This exercise will require the input of subject and psychometric experts. The exercise should be part of statutory quality assurance procedures. Dogged adoption of this singular recommendation is apt to significantly enhance the quality of graduates and certification in higher institutions.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix 1
See Table 6.     Bold values are the indices for the correct options, which also represent the 'difficulty indices'